100% found this document useful (2 votes)
167 views328 pages

Виллемсе И., Ниелисани П. Статистические методы и навыки расчетов

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
167 views328 pages

Виллемсе И., Ниелисани П. Статистические методы и навыки расчетов

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 328

Spine

17mm

Fourth
Edition
Fou rth
Edition
Statistical Methods
and Calculation Skills

Statistical Methods and Calculation Skills


As with its predecessors, this fourth edition of Statistical Methods and Calculation Skills aims to
equip students with the skills to apply statistical analysis and quantitative techniques in the
research and working environments, and to make effective decisions.

Key features:
• A theoretical framework for statistical problem-solving
• A practical step-by-step approach to applying methods and calculations
• A complete list of outcomes in each unit
• Worked examples with detailed explanations
• Guided activities and a range of self-test questions.

Part A – statistical methods – covers the collection and presentation of data; descriptive
and inferential methods of analysis; index numbers; regression and correlation analysis;
time series; probability and probability distributions; statistical estimation; and hypothesis
testing. Calculation skills are revised in Part B, which deals with elementary calculations

Statistical Methods
such as exponents, decimals, scientific notation, logarithms and rounding. Students with no
mathematics background can learn how to do basic calculations before going on to statistical
applications. For some courses, calculations such as interest, future values of investments,
graphs and ratios form part of the core module and are also covered here.

The book includes examples and activities from the fields of business, food and biotechnology,
engineering, medicine and environmental studies.

About the authors:


and Calculation Skills
I Willemse | P Nyelisani
Isabel Willemse was a Statistics lecturer in the Department of Statistics at the University of
Johannesburg and is now retired.
Peter Nyelisani teaches Statistics and is deputy head of the Department of Statistics at the
University of Johannesburg.

Fou rth E dition

I Willemse | P Nyelisani
www.jutaacademic.co.za

Juta-StatisticalMethods-FA.indd 1 2014/12/19 9:00 AM


Spine
17mm
Statistical Methods
and Calculation Skills

Isabel Willemse
Peter Nyelisani

Statistics_Method_BOOK.indb 1 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

First published 2015


First print published 2001
Second edition 2003
Third edition 2009
Fourth edition 2015

Juta and Company (Pty) Ltd


First Floor
Sunclare Building
21 Dreyer Street
Claremont
7708

PO Box 14373, Lansdowne 7779, Cape Town, South Africa

© 2015 Juta & Company (Pty) Ltd

ISBN 978 1 48510 276 2 (Print)


ISBN 978 1 48510 486 5 (WebPDF)

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording, or any information storage or retrieval system,
without prior permission in writing from the publisher. Subject to any applicable licensing terms and condi-
tions in the case of electronically supplied publications, a person may engage in fair dealing with a copy of this
publication for his or her personal or private use, or his or her research or private study. See section 12(1)(a) of the
Copyright Act 98 of 1978.

Project Manager: Willemien Jansen


Editor: Rod Prodgers
Proofreader: Michelle Savage
Cover designer: MR Design
Typesetter: Trace Digital Services

The author and the publisher believe on the strength of due diligence exercised that this work does not contain
any material that is the subject of copyright held by another person. In the alternative, they believe that any pro-
tected pre-existing material that may be comprised in it has been used with appropriate authority or has been
used in circumstances that make such use permissible under the law.
Contents

PART ONE: STATISTICAL METHODS................................................................................................... 1

Unit 1: Introduction................................................................................................................................ 3
1.1  Problem-solving steps.................................................................................................................... 4
1.2 Definition........................................................................................................................................ 5
1.3  The language of statistics............................................................................................................... 6
1.4 Measurement.................................................................................................................................. 7
1.5  Role of the computer in statistics.................................................................................................. 10
TEST YOURSELF 1................................................................................................................................. 11

Unit 2: Collection of data....................................................................................................................... 15


2.1  Sources of data: where to get the data.......................................................................................... 15
2.2  Primary data sources...................................................................................................................... 16
2.3  Questionnaire design..................................................................................................................... 20
2.4 Selecting a sample........................................................................................................................... 21
2.5 Non-random or non-probability sampling methods.................................................................. 23
2.6  Random sampling.......................................................................................................................... 25
TEST YOURSELF 2................................................................................................................................. 27

Unit 3: Summarising data using tables and graphs............................................................................ 29


3.1 Summarising qualitative data in tables and graphs.................................................................... 30
3.2  Summarising quantitative data in tables...................................................................................... 41
3.3  Summarising quantitative data using graphs.............................................................................. 52
3.4  Using software................................................................................................................................ 58
TEST YOURSELF 3................................................................................................................................. 58

Unit 4: Summarising data using numerical descriptors..................................................................... 65


4.1  Measures of central tendency........................................................................................................ 66
4.2  Measures of dispersion.................................................................................................................. 81
4.3  Measures of shape.......................................................................................................................... 90
4.4  Interpreting centre and variability................................................................................................ 94
4.5  Measures of relative standing........................................................................................................ 98
4.6 Measuring dispersion using measures of relative standing........................................................ 103
TEST YOURSELF 4................................................................................................................................. 108

Unit 5: Index numbers........................................................................................................................... 113


5.1  Construction of a simple index number....................................................................................... 115
5.2 Construction of composite (or aggregate) index numbers......................................................... 116
5.3  Additional topics on index numbers............................................................................................. 120
TEST YOURSELF 5................................................................................................................................. 124

Statistics_Method_BOOK.indb 3 2014/12/18 3:01 PM


Statistical methods and calculation skills

Unit 6: Summarising bivariate data: simple regression and correlation analysis............................. 127
6.1  Response variable (y) and explanatory variable (x).................................................................... 128
6.2  Scatter diagram............................................................................................................................... 128
6.3  Correlation analysis (r).................................................................................................................. 131
6.4  Regression analysis........................................................................................................................ 134
6.5  Spearman rank correlation coefficient (rs).................................................................................. 138
TEST YOURSELF 6................................................................................................................................. 140

Unit 7: Time series................................................................................................................................. 145


7.1  Components of a time series......................................................................................................... 145
7.2 Historigram..................................................................................................................................... 147
7.3  Time-series decomposition........................................................................................................... 148
TEST YOURSELF 7................................................................................................................................. 162

Unit 8: Probability: basic concepts....................................................................................................... 167


8.1  Language of probability................................................................................................................. 168
8.2  Approaches to assigning probabilities.......................................................................................... 169
8.3  Properties of probabilities............................................................................................................. 172
8.4  Forming new events....................................................................................................................... 174
8.5  Probability rules for compound events........................................................................................ 177
8.6  Counting the possibilities.............................................................................................................. 187
TEST YOURSELF 8................................................................................................................................. 189

Unit 9: Probability distributions............................................................................................................ 197


9.1  Discrete probability distributions................................................................................................. 198
9.2 Probability distributions for continuous random variables........................................................ 203
TEST YOURSELF 9................................................................................................................................. 211

Unit 10: Statistical inference: estimation............................................................................................. 215


10.1  Statistics and parameters............................................................................................................. 215
10.2  Sampling distribution of the means............................................................................................ 216
10.3  Estimating population parameters............................................................................................. 217
10.4  Sample size (n)............................................................................................................................. 224
TEST YOURSELF 10............................................................................................................................... 226

Unit 11: Hypothesis testing................................................................................................................... 229


11.1 A single sample classical hypothesis test.................................................................................... 230
11.2  Hypothesis testing using the P value approach......................................................................... 239
11.3 Testing the difference among means and proportions............................................................. 242
11.4  Tests using the chi-square distribution (x2)................................................................................ 246
TEST YOURSELF 11............................................................................................................................... 252

PART TWO: CALCULATION SKILLS..................................................................................................... 259

Unit 12: Elementary calculations.......................................................................................................... 261


12.1  The electronic calculator.............................................................................................................. 261
12.2  The number system...................................................................................................................... 263
12.3  Common notation........................................................................................................................ 266
12.4  Basic operations........................................................................................................................... 267
12.5  Signed numbers............................................................................................................................ 270
12.6  Exponents (powers) (xy)............................................................................................................... 271

iv

Statistics_Method_BOOK.indb 4 2014/12/18 3:01 PM


Contents


12.7  Square roots (​0 ​)  .......................................................................................................................... 271
12.8  Logarithms (log)........................................................................................................................... 272
12.9  Factorial notation (!)..................................................................................................................... 272
12.10  Sigma notation (S)..................................................................................................................... 273
12.11 Fractions...................................................................................................................................... 274
12.12  Decimal numbers....................................................................................................................... 276
12.13  Scientific notation...................................................................................................................... 277
12.14  Rounding off decimals............................................................................................................... 278
12.15  Significant digits......................................................................................................................... 280
12.16  The metric system....................................................................................................................... 283

Unit 13: Percentages and ratios........................................................................................................... 285


13.1  Percentage calculations............................................................................................................... 285
13.2  Ratio (proportion) calculation.................................................................................................... 290
13.3  Business applications................................................................................................................... 291
TEST YOURSELF 13............................................................................................................................... 294

Unit 14: Equations and graph construction......................................................................................... 297


14.1  Graph construction...................................................................................................................... 297
14.2  Solution of equations................................................................................................................... 299
TEST YOURSELF 14............................................................................................................................... 302

Unit 15: Interest calculations................................................................................................................ 305


15.1  Basic concepts.............................................................................................................................. 305
15.2  Simple interest.............................................................................................................................. 306
15.3  Compound interest...................................................................................................................... 307
15.4  Nominal and effective rates of interest....................................................................................... 310
15.5 Annuities....................................................................................................................................... 310
TEST YOURSELF 15............................................................................................................................... 314

Appendix 1: The standard normal distribution.................................................................................. 318


Appendix 2: The t–distribution............................................................................................................ 319
Appendix 3: The chi-square distribution............................................................................................ 320
Appendix 4: Random numbers........................................................................................................... 321

Statistics_Method_BOOK.indb 5 2014/12/18 3:01 PM


Statistics_Method_BOOK.indb 6 2014/12/18 3:01 PM
PART 1
Statistical Methods

Statistics_Method_BOOK.indb 1 2014/12/18 3:01 PM


Statistics_Method_BOOK.indb 2 2014/12/18 3:01 PM
UNIT

1 Introduction

This unit deals with the role of statistics in the data analysis process. Concepts
that are basic to the study of statistics are discussed.

After completion of this unit you will be able to:


• recognise the role of statistics in life
• understand the language of statistics
• select suitable measuring scales for different types of data
• understand the role of computers in statistics.

We live in an era where we are faced with increasing amounts of information,


also referred to as data. Every time you read a magazine or newspaper, or
listen to a news bulletin or advertisement, you encounter statistics. People
quote numbers or statistics to support whatever it is they wish you to believe.
Therefore, to perform many tasks efficiently in today’s world you need to have a
basic understanding of statistical methods.
The subject field of Statistics covers a problem-solving process that seeks
answers to questions through data. By itself, data cannot tell you much. When
collected and used properly, data and the statistics calculated from it can help you
to understand situations in order to evaluate your options and make informed
decisions.
To be an informed consumer of information, you must be able to:
• extract information from tables and graphs
• follow numerical arguments
• understand the basics of how data should be gathered, summarised and
analysed to draw statistical conclusions.

Statistics_Method_BOOK.indb 3 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

1.1  Problem-solving steps


Solving a statistical problem typically comprises the following steps:
1. Identify the problem and ask the question you hope to answer.
2. Collect the information (or data) needed to answer the problem: Identify
an appropriate data source and decide how to measure it. Decide whether
an existing data source is adequate or whether new data must be collected.
Determine if you will use an entire population or a representative sample. If
using a sample, decide on a viable sampling method.
3. Analyse the data: Organise and summarise the data into tables and graphs,
which are effective ways to present data. Numerical summaries allow
increased understanding by making use of single values to represent the
data. This initial analysis provides insight into important characteristics of
the data and gives guidance in selecting appropriate methods for further
analysis.
4. Interpret the results in order to draw conclusions, make recommendations
and assess the risk of an incorrect decision about the original problem under
investigation. With sampling, the process usually involves generalising from
a small group – or sample – of individuals or objects that were studied to a
much larger group or population.

Example 1.1

As part of a weekly quality check to access the calibration of a filling machine,


the quality control manager randomly selects 50 bottles of beer that were filled
on a specific day.
1. Ask a question: Is the calibration of the filling machine still within acceptable
standards?
2. Collect the appropriate data: Randomly select 50 bottles on a specified day
and measure the contents of each bottle. Record the measurements to the
nearest millilitre.
3. Analyse the data: Summarise the data in a table and draw a graph, such as
a scatter plot, to show the sample data as well as a line graph on the same
plot to indicate the desired fill. The average fill of the sample bottles can also
be calculated together with the standard deviation and other descriptive
summary statistics.
4. Interpret the results and draw conclusions. For example: Compare the scatter
plot with the required standard line graph to get a visual impression of any
deviations. The sample average can also be compared with the required

Statistics_Method_BOOK.indb 4 2014/12/18 3:01 PM


 Introduction

average to access the calibration of the filling machine. You can extend the
results from the sample of 50 bottles to all bottles filled during that week.

Key components of statistical thinking:


• Use data whenever possible to guide the analysis.
• Look for connections and relationships.
• Understand why data values differ from one another.
1.2 Definition
Statistics is the scientific discipline that provides methods to help us make sense
of data by:
• collecting data in a methodical way
• analysing data using methods to organise and summarise data with tables,
graphs and numbers
• interpret data to draw conclusions or to answer questions.
The field of statistics can be subdivided into descriptive statistics and inferential
statistics.
Descriptive statistics includes the collection and summarising of data to
give an overview of the information collected.
Inferential statistics entails a process of making an estimate, prediction or
decision about a population based on sample data.
Because a population is almost always very large, a sample is drawn from
the population of interest and summarised using descriptive techniques.
These results are then used in inferential statistics to make decisions about the
population. Such conclusions are seldom going to be correct and it is therefore
necessary to measure the reliability of the conclusions using the confidence
level and the significance level. The confidence level measures the proportion
of times that an estimating procedure will be correct over the long run. When
the purpose of statistical inference is to draw conclusions about the population,
the significance level measures how frequently the conclusion will be wrong. A
2% significance level means that, in the long run, this type of conclusion will be
wrong 2% of the time.

Statistics_Method_BOOK.indb 5 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

1.3 The language of statistics


• An investigation or experiment is any process of observation or measure­
ment.
• Elements are the people or objects about which information is collected.
• A population is the entire group about which you want information. If the
population contains a countable number of items, it is said to be finite, and
when the number of items is unlimited, it is said to be infinite. A study of
the entire population is known as a census. A parameter is a numerical
measure that describes the population. It is calculated using all the data of
the population, such as an average. It is usually indicated by a letter from the
Greek alphabet (e.g. m, s, p).
• To gain information about the population, a portion of the population data
can be examined. This portion of data is called a sample. The sample must
be representative of the population. A representative sample is one in which
the relevant characteristics of the sample elements are generally the same
as the characteristics of the population elements. A statistic is a numerical
measure that describes a sample. It is usually indicated by a letter from the
Roman alphabet (e.g. x, s, n, p).
• A variable is a characteristic of interest about each element of a population
or sample. It is the topic about which data is collected, such as the age of
first-year students at a university or the mass of each first-year student. Not
all students are the same age or weigh the same; this will vary from student
to student. That means there is a variation in the weights and ages. If there
were no variability in the weights or ages, statistical inference would not be
necessary. The observed values of the variable are the data you will use in a
statistical investigation.
• Variables can be classified as quantitative or qualitative.
• Qualitative or categorical variables provide information that is non-
numerical, such as marital status, type of job, gender, etc. Qualitative
information can sometimes be coded to make it appear quantitative, but will
have no meaning on a number line.
• Quantitative variables provide numerical measurements of the elements
of a study. Arithmetic operations such as addition and subtraction can be
performed on the values of a quantitative variable.

We can further classify quantitative variables as discrete or continuous.


• Discrete variables are countable and can assume a countable number of
values, such as the number of potatoes on a plant. Fractional values can also

Statistics_Method_BOOK.indb 6 2014/12/18 3:01 PM


 Introduction

occur, but must have distance between them, for example interest rates and
stock prices.
• If you have to measure or weigh to get the value of the variable, it is continuous.
It has an infinite number of possible values that are not countable. For example
mass, length, time taken to complete a task, age, etc. can be measured to any
desired accuracy or number of decimal places within a given range.

Example 1.2

Distinguish between qualitative and quantitative variables:


1. Gender: it is a qualitative variable because it allows a researcher to categorise
the individual as male or female. No arithmetic operations can be performed
with this data.
2. Temperature: it is a quantitative variable because it is numeric and arithmetic
operations such as addition and subtraction provide meaningful results.
3. Postal code: it is qualitative because it indicates a location. Although the
code is in numbers, addition and subtraction of the codes do not provide
meaningful results.
4. Number of drinks at a party for a couple of friends: it is quantitative because
it provides numbers which can be used in arithmetic operations.

Example 1.3

Distinguish between discrete and continuous variables:


1. The number of heads obtained after flipping a coin five times: discrete,
because we can count the number of heads obtained.
2. The number of cars that arrive at a KFC drive-through between 10h00 and
12h00: discrete, because we can count the number of cars.
3. The distances that different model cars with the same tank capacity can
drive in city driving conditions: continuous, because we have to measure the
distances.
4. Temperature: continuous, because we have to measure temperature.

1.4 Measurement
Measurement is the process we use to assign a value to the observations or
elements of a variable. This set of values for a given variable is known as data.

Statistics_Method_BOOK.indb 7 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

We distinguish between variables that are measured in numbers and those


that are not. These two types of variables are called quantitative variables and
qualitative variables. Some questions, such as ‘How old are you?’, are answered
with a number. Answers to questions such as ‘What is your gender?’ do not
require a number.
There are four levels or scales of measurement, each with its own
characteristics, and from weakest to strongest they are: nominal, ordinal,
interval and ratio. The analysis you carry out depends on the type of scale used
to measure the characteristics of the variable.

1.4.1 Nominal scale


This level, also known as a categorical level, applies to data that consists of
names, labels and categories in no specific order. Numbers can be used as
symbols to represent certain characteristics such as age, income, height of the
object or person, etc. For example, your student number may identify you, or
when counting males and females the male group can be assigned the code
01 and the females the code 02. These numbers cannot be added, subtracted,
multiplied or divided. These nominally scaled numbers serve only as a label
for the group and the measurement consists of placing the data in the correct
group. No arithmetic operations can be performed on such numbers other than
counting the groups and the number of elements falling into each group.

1.4.2 Ordinal scale


The categories into which objects are grouped are ranked in some order using
numbers or symbols. Items can be classified not only as to whether they share
some characteristic with another item but also whether they have more or
less of this characteristic. Differences between data values either cannot be
determined or are meaningless, e.g. income levels such as low, medium or high.
The permissible analysis methods for ordinal data include techniques generally
associated with the order of the observations.

1.4.3 Interval scale


This scale applies to data that can be arranged in order. In addition, differences
between data values are meaningful, but ratios of data are not. Temperature is
a classic example of an interval scale: the increase on the Celsius scale between
10 and 20 is the same as the increase between 30 and 40. However, heat cannot
be measured in absolute terms (0 °C does not mean no heat) and it is not possible
to say that 40 °C is twice as hot as 20 °C. Interval-level data does not have an

Statistics_Method_BOOK.indb 8 2014/12/18 3:01 PM


 Introduction

absolute zero starting point. This sometimes causes difficulties in interpreting


interval-scale data. Arithmetic operations can be performed on the difference
between the numbers, not the numbers themselves.
The following are examples of data at the interval level of measurement:
• calendar dates
• time
• shoe sizes
• sea levels
• Celsius scale temperatures.

1.4.4 Ratio scale


The ratio level of measurement applies to data that can be arranged in order. Both
differences between data values and ratios of data values are meaningful because
a true zero exists. Arithmetic operations can be performed on the numeric
values themselves. Money is an example of the ratio scale of measurement: the
zero point is meaningful – that is, at zero you have none; and R10 is twice as
much as R5.

Activity 1.1

Categorise these measurements relating to fishing according to level:


1. species of fish in the Vaal dam
2. cost of rod and reel
3. time of return home
4. rating of fishing area: poor, fair, good
5. number of fish caught
6. temperature of the water.

Activity 1.2

The student council at a university with 10  000 students is interested in the
proportion of students who favour a change in the admission requirements at
the university. Two hundred students are interviewed to determine their attitude
toward this proposed change. Of the 200, 64 (or 32%) are in favour of a change.
The student council announced that less than 35% of all the students are in
favour of a change.
a) What is the question to be answered in this investigation?

Statistics_Method_BOOK.indb 9 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

b) What is the variable of interest?


c) Classify the variable in terms of the type of data to be collected and the
measurement scale of the data.
d) What is the population of interest?
e) What group of students constitutes the sample in this problem?
f) What is the sample statistic?
g) What is the population parameter?

1.5 Role of the computer in statistics


In all aspects of business life we are likely to encounter increasing quantities
of data. Computers and new information technologies literally put data at our
fingertips; for example, stock levels in a warehouse some distance away or share
prices in Japan can be established in minutes.
The Internet can provide access to data across continents at low cost. The
challenge is to organise and analyse this information in such a way that managers
can make sense of it by utilising statistical and quantitative techniques. Facilities
such as spreadsheets or statistical and mathematical software packages make
analysis techniques readily available to everyone. The effective use of such
computer software requires that you are able to interpret the output that can
be generated, not only in a strictly quantitative way but also in assessing its
potential to help in business decision-making.
Computers also provide the opportunity to experiment with and explore data
in ways that would not otherwise be possible.
A computer may be efficiently used in any processing operation that has one
or more of the following characteristics:
• large volume of input
• repetition of projects
• greater speed desired in processing
• greater accuracy
• processing complexities that require electronic help.
It can help you develop your ideas about how to organise the information by
using a‘try and refine’ approach, which can take too long to carry out manually.

10

Statistics_Method_BOOK.indb 10 2014/12/18 3:01 PM


 Introduction

TEST YOURSELF 1
1. A survey of 100 people is conducted and questions are asked relating to the
following characteristics:
• marital status
• salary
• occupation
• number of hours of television watched per week.
What type of data and measurement scales are applicable?
2. The personnel manager of a business is studying employee morale and uses
a questionnaire to collect data. A typical question on the questionnaire: ‘I
feel that I am performing a valuable service for society when I do my job
well.’ Circle the letter that most closely represents your agreement with the
statement:

Strongly Agree Undecided Disagree Strongly


agree disagree
A B C D E

For the data generated by this question, state:


a) the elements to be observed
b) the variable being measured
c) whether the data is quantitative or qualitative
d) the measurement scale that should be used to record the variable.
3. ‘Every week a clerk in a hypermarket records the number of transactions
that occurred that week at each of the checkout tills.’
‘Once an hour a random sample of 100 battery chargers is selected from an
assembly line and the number of defective chargers is recorded.’
For the above two statements:
a) What elements are being observed?
b) Define the variable.
c) What type of data is being used?
d) What is the measurement scale of each data set?
e) Is the data collected from the population or a sample of each data set?
4. Say whether each of the following variables is quantitative or qualitative
and indicate the measurement scale that is appropriate for each:
a) age of a respondent to a consumer survey
b) sex of a respondent to a consumer survey
c) thickness of the gelatine coating of a vitamin E capsule
d) make of motor car owned by a sample of 50 drivers

11

Statistics_Method_BOOK.indb 11 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

e) percentage of people in favour of the death penalty in each of the


provinces
f) concentration of a contaminant (m per cc) in a water sample
g) amount by which a 1  kg package of beef mince decreases in mass
because of moisture lost before purchase
h) length of a 1-year-old molesnake.
5. Based on a study of 2 050 children between 2 and 4 years of age, researchers
concluded that there was an association between iron deficiency and the
length of time that a child is bottle-fed. Describe the sample and the population
of interest for this study. Define the variables and type of data that were used.
6. The leader of a rural community is interested in the proportion of property
owners who support the construction of a sewerage system. Because it is
too difficult to reach all 7 000 property owners, a survey of 500 randomly
selected owners is undertaken. Of the 500 owners, 420 indicated that they
support the sewerage system. The leader of the community announces that
more than 90% of the owners support the construction of a sewer system.
a) Describe the population and sample for this problem.
b) Define the variable of interest.
c) Describe the type of data that will be needed and the measurement scale
of the variable.
d) Does the announcement of the leader comprise descriptive or inferential
statistics?
e) Is the 90% a statistic or a parameter?
f) Is the 420 out of 500 owners (94%) a statistic or a parameter?
7. All South Africans are involved in at least one form of gardening. This result
shows that gardening is one of the most popular leisure activities. Classify
this study as either descriptive or inferential.
8. A random sample of 200 academic staff members was taken at a university.
Each was asked the following questions:
• What is your rank (lecturer, senior lecturer, professor)?
• What is your annual salary?
• In which faculty (Business, Engineering, Arts) are you employed?
• How many years have you been employed?
Identify the type of data as quantitative or qualitative. If quantitative,
classify as either discrete or continuous. Indicate the measurement scale in
each category.
9. For each of the following examples, determine the type of data to be collected
and the measurement scale:

12

Statistics_Method_BOOK.indb 12 2014/12/18 3:01 PM


 Introduction

a) the month of highest sales for each supermarket in a sample


b) the weekly closing price of gold throughout the year
c) country of origin
d) a taste tester’s ranking (best, worst, etc) of four brands of tomato sauce
for a panel of 10 testers
e) the size of soft drink (small, medium, large) ordered by a sample of Big
Burger customers
f) the marks achieved by the students in a Statistics exam in which there
were five questions, each worth 10 marks
g) the grades received by students in a Statistics course (A, B, C, D, E)
h) do you have season tickets for Ellis Park?
i) the number on a rugby player’s jersey
j) number of unpopped kernels in a bag of microwave popcorn.
10. For each of the following case studies identify the sample and population:
a) An allergy institution contacted 2  079 teenagers between 13 and 17
years old who live in South Africa, and asked whether or not they used
prescribed medication for any mental disorders such as depression or
anxiety.
b) A farmer wanted to estimate the mass of his soybean crop. He randomly
picked 100 plants and weighed the soybeans on each plant.
c) A quality control manager randomly selects 50 bottles of soft drink
that were filled on a specific day to assess the calibration of the filling
machine.
11. Chemical and manufacturing plants sometimes discharge toxic waste
materials into nearby rivers and streams. These toxins can adversely affect
the plants and animals inhabiting the river and river banks. Researchers
conducted a study of fish in the rivers in the Gauteng area. A total of 124
fish were captured and the following variables were measured for each:
• river where each fish was captured
• species
• length (cm)
• mass (g)
• concentration of toxins (µ per cc).
Classify each of these variables as quantitative or qualitative. If quantitative,
indicate whether it is discrete or continuous. Indicate the measurement
scale of each category.
12. A Mail & Guardian poll of a sample of South Africans revealed that ‘85%
of those surveyed would choose organically grown produce over produce

13

Statistics_Method_BOOK.indb 13 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

grown using chemical fertilisers, pesticides and herbicides’. Is the statement


an inferential or descriptive statement? Explain your answer.
13. The owner of a large fleet of taxis is busy with his budget for the next year’s
operations. A major cost is petrol. To estimate the petrol costs he needs the
total distance his taxis will travel in that year, the average cost of petrol
and the average petrol consumption of his taxis. The first two figures are
known by the owner, but to obtain the last one he selected 50 of his taxis
and measured the consumption of each.
a) What is the population of interest?
b) What is the parameter the owner needs?
c) What is the sample?
d) What are the statistics?

14

Statistics_Method_BOOK.indb 14 2014/12/18 3:01 PM


UNIT

2 Collection of data

This unit deals with how and where to obtain data that can be used to make
informed decisions. Data collection is the process of collecting, counting and
recording of information.
The quality of the final results depends on the quality of the raw material
collected. Researchers have adopted the acronym GIGO – garbage in, garbage
out – to emphasise this fact.

After completion of this unit you will be able to:


• distinguish between primary and secondary data sources
• examine various sources of primary data
• appreciate the art of questionnaire design
• distinguish between probability and non-probability samples
• conduct a sample
• distinguish between different methods of data collection.

Methods of data collection depend upon:


• the nature of the problem
• the time available
• the money available
• data sources
• the degree of accuracy desired.
2.1  Sources of data: where to get the data
A statistical study may require the collection of new data from scratch, referred
to as primary data, or be able to use already existing data, known as secondary
data. It is also possible to use a combination of both sources.

Secondary data is already available in processed form, such as a database, the


Internet, libraries or records kept within your company, and has been collected

Statistics_Method_BOOK.indb 15 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

for some purpose other than you intend to use it for. Data is often collected
through the use of secondary sources because it is available at low cost, but you
need to be sure that you are not using unsuitable data just because it is easily
available. Secondary data can be obtained internally or externally.

Internal data comes from within the organisation for its own use, for example
from accounting records, payrolls, inventories, sales records, etc.

External data is collected from sources outside the organisation, such as


trade publications, consumer price indexes, newspapers, libraries, universities,
official statistics supplied by the Department of Statistics and other government
departments, a Nielsen report on shopping behaviour, stock exchange reports,
databases of the Department of Statistics, data on the unemployment rate
supplied by the Department of Labour, or data on HIV/Aids provided by the
Department of Health or websites on the Internet.

Primary data is information collected by those wishing to collect their own data.
The distinguishing feature of this data is that it will be both reliable and relevant
to your purpose. As a result, primary data can take a long time to collect and may
be expensive. Sources of primary data include experiments, observation, group
discussions and the use of questionnaires under controlled conditions.
There are multiple methods and tools that can be used to collect data, but you
must decide which method(s) will best answer your research questions.
The four main methods of collecting data are:
• face to face
• by phone
• by post
• via the Internet.
There are advantages and disadvantages to using each of these methods. One
might be better suited to a particular survey than another.

2.2  Primary data sources


You can obtain primary data by:
• conducting an investigation or experiment
• observation
• focus groups
• conducting surveys using questions.

16

Statistics_Method_BOOK.indb 16 2014/12/18 3:01 PM


  Collection of data

2.2.1 Conducting an experiment


In conducting an experiment you deliberately impose some treatment on
individuals or objects in order to observe the responses. The purpose of an
experiment is to study whether the treatment causes a change in the response.

Example 2.1

To determine if there is any relationship between the hours of TV viewing and


the channel viewed, we selected a random sample of students and told each one
which TV channel to watch over a weekend. Each student recorded the number
of hours of TV watched.

2.2.2 Observations
In an observational survey, collecting data relies on watching or listening very
carefully, and then counting or measuring events as they happen without any
interaction with the individuals or objects. You draw up an observation sheet
and keep count of the observations in a tally table, using straight vertical lines
for each item counted up to 4 (| | | |). The fifth event is a line across the first four
lines (| | | |) so that you can easily tally the total in multiples of 5. The variables
of interest are not controlled.

Example 2.2

The metro police wanted to determine whether motorists using a certain road
wore seatbelts. They observed whether drivers used seatbelts and counted how
many wore seatbelts and how many did not.

The number of motorists wearing seatbelts between 7 am and 8 am on f


27 February 2013
Wearing seatbelts | | | | | | | | | | | | | | | | | | 22
Not wearing seatbelts | | | | | | | | | | | | 15
Total number of motorists 37

17

Statistics_Method_BOOK.indb 17 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

2.2.3 Focus groups


A small sample of the target group is selected to learn how respondents talk
about the topic of interest. Group discussions are useful to explore a topic and
stimulate new ideas and creative concepts, providing a broader understanding
of why the target group may behave or think in a particular way, and assist in
determining the reasons for attitudes and beliefs. This may facilitate the design
of questionnaires or other research tools.

2.2.4 Survey by means of asking questions


Surveys are a good way of gathering a large amount of data. You need to decide
what questions will be asked by designing a questionnaire and choosing how
these questions will be put to the people. Record the answers in written form.
These answers form the basis of your statistical analysis.
Some of the most commonly used methods to collect data when conducting a
survey using a well-designed questionnaire are:
• personal interview
• telephone interview
• mail questionnaire
• electronic questionnaire, using the Internet.
During a personal interview the data is obtained verbally and face to face.
Interviewers select candidates randomly from appropriate places, such as a
university campus or shopping centre. This method is popular with companies
conducting market research about specific products. The interviewer must tell
the respondent beforehand how long the interview will take, otherwise the
randomly selected respondent may try to avoid the interview. Interviewers must
be trained to ask questions and record responses, which makes this method more
costly and time-consuming. An advantage of this method is that you can obtain
in-depth responses from respondents, not only by listening to the answer but
also by interpreting their body language. The interviewer can clarify difficult
questions and show visual displays or products to the respondent to provide
better communication and motivation to participate in the survey.
Telephone interviews involve the presentation of the questionnaire by
telephone. Telephone surveys are less costly than personal interviews and can
be conducted over wider geographical areas. People are more open in their
opinions as there is no face-to-face contact. One of the major drawbacks is that
some people in the sample will not have phones or will not be home when you
call them.

18

Statistics_Method_BOOK.indb 18 2014/12/18 3:01 PM


  Collection of data

In a mail questionnaire respondents are asked to complete and return a


questionnaire which they receive in the mail, in a newspaper or magazine, or
attached to a product. This method can cover a large sample since it is relatively
cheap to administer. Respondents can remain anonymous if they desire and will
therefore be more open and honest in their opinion. Disadvantages of this method
include a low response rate, inappropriate answers to questions, no allowance
for any observations and illiteracy of some people included in the sample.
A new and fast-growing method is the use of Internet-based questionnaires.
An e-mail containing a clickable link is sent out and respondents are asked to
click on the link to take them to a secure website to fill in a questionnaire. This
method is quick and inexpensive but often less detailed. Disadvantages are the
same as for the mail questionnaire, as well as excluding people who do not have
a computer or are computer illiterate.

Activity 2.1

Rate the survey methods as either 1 – most appropriate, 2 – less appropriate or


3 – least appropriate, under the following circumstances:

Telephone Personal Mail


interview interview questionnaire
Large geographical area
Small sample
Difficult questions
Keeping the cost low
Body language
If speed is a factor
Response rate
Illiteracy
Training of interviewers
Confidentiality
Market research for a
product

19

Statistics_Method_BOOK.indb 19 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

2.3  Questionnaire design


The basis of statistical analysis will be the data obtained in response to questions.
Careful attention must therefore be given to the design of the survey. You now
need to decide on the questions that will be used, and how these questions will
be asked. To test for validity and reliability it is necessary to run a pilot test of the
survey on a sample of your target group to ensure that it is measuring what it
intends to measure.
A questionnaire needs both a logical structure and well thought- out questions.
The structure of the questionnaire should ensure that there is a logical flow from
question to question. Any radical jumps between topics will tend to disorientate
the respondent and will influence the answers given.

2.3.1 A questionnaire can be divided into different sections


The sections could include the following:
• administrative: date, name, address, etc
• classification: race, sex, age, marital status, occupation, etc
• subject-matter of inquiry (the questions).

2.3.2 Question wording


When formulating the questions, make every effort to ensure that the wording
meets the following criteria:
• All questions should be appropriate to the research topic.
• Each question should be short and easy to understand.
• Questions should be unbiased (do not lead the respondent to give a particular
answer).
• Questions should not be phrased emotively. Place questions that may evoke
an emotional response near the end of the questionnaire since they may
influence responses that follow.
• Questions should not be offensive or embarrass the respondent.
• Wherever possible, a choice of answers should be given (closed questions).
Make sure that every possible answer is covered. When this is not possible,
adequate space should be given for answers.
• Confidentiality should be assured.

2.3.3 Types of questions we can ask


1. Closed questions give the respondent a series of possible answers from which
one must be chosen. This approach makes it easy to record the required
information and reduces interviewer bias. Examples are:

20

Statistics_Method_BOOK.indb 20 2014/12/18 3:01 PM


  Collection of data

• yes/no answers
• tick boxes
• numbered responses
• word responses.
2. Open-ended questions will allow respondents to give their own opinions in
their own words and to express any thoughts that they feel are appropriate
to the question. As a result, depending on the nature of the question and
the interest of the respondent, answers may vary a great deal in length and
detail.

Activity 2.2

Identify whether the following are open-ended or closed questions:


1. How do you feel about violence in your neighbourhood?
2. Do you regularly watch soccer on TV? Yes or no.
3. How often do you watch soccer on TV?
• never
• sometimes, but not every week
• one game every week
• two or more games per week.
4. What will you do to improve attendance at your school’s sporting events?
5. How reliable is your calculator?
6. How would you rate the reliability of your calculator?
Superior Very good Good Poor

2.4 Selecting a sample


We can perform any method we choose to collect the data on either a population
or a sample.
A census is conducted if we collect data on all the elements of the population.
The National Census is conducted when each household in South Africa
receives a census form to complete, providing information about everybody in
that household. This takes place on a predetermined reference date at least once
every ten years and is carried out by Statistics South Africa.
A sample is taken from a sampling frame, which is a complete list of people or
objects comprising the population.

21

Statistics_Method_BOOK.indb 21 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

2.4.1 Advantages of sampling


• Costs are reduced.
• Collection time is reduced.
• Overall accuracy is improved.
• For several types of populations (for example, infinite populations) or testing
procedures that entail the destruction of the item being tested (such as tests
determining the life of a light bulb or the length of time a match will burn),
sampling is the only method of data collection.

2.4.2 Sampling laws


By studying the behaviour of a sample you can get a good idea of the behaviour
of the population from which the sample was drawn. If you summarise and
evaluate the sample data, you can estimate and draw conclusions about the
population parameters based on the sample results or statistics.
To ensure that the sample is representative of the population and that valid
inferences about the population can be drawn from the sample, sampling must
be based on two general laws:
• The Law of Statistical Regularity holds that a reasonably large number
of items selected at random from a large group of items will, on average,
have characteristics representative of the population. It is important that the
selection of the sample is random so that every item in the population has an
equal chance of selection. The size of the sample should be large enough to
minimise the influence of abnormal items on the average.
• The Law of Inertia of Large Numbers holds that large groups of data show
more stability than small ones.

2.4.3 Sampling error


We cannot expect that the sample results will be the same as the population
results (if known). This difference between the sample statistic and the actual
population parameter is known as sampling error. The smaller the sampling
error, the more accurate the estimate for the population parameter.

Factors that have an influence on sampling error


• The sample size: the larger the sample, the more similar the sample statistics
will be to the population parameter.
• The amount of variation among the values in the population: suppose you
want to investigate the amount of pocket money children receive every month.
If these amounts are more or less the same, the variability in the population

22

Statistics_Method_BOOK.indb 22 2014/12/18 3:01 PM


  Collection of data

is small and a small sample will be sufficient. If the amounts differ a lot, the
variability is greater and a larger sample is needed.

2.4.4 Sample size (n)


In later study units a formula will be applied to determine sample size. For
now, we will briefly look at the factors that influence sample size. The random
selection process allows us to be confident that the resulting sample adequately
reflects the population, even when the sample consists of only a small fraction
of the population.

2.4.5 Sample design


The design of a sample describes the method used to select the sample from the
population.
Sampling design can be divided into two broad categories: those where
elements are selected by some random method and those where the elements
are non-randomly selected.

Types of samples

Non-random sample Random sample

Convenience sample Simple Random sample

Snowball sample Cluster sample

Voluntary response Stratified sample

Judgement sample Systematic sample

2.5 Non-random or non-probability sampling


methods
If the sample items are selected using personal convenience, expert judgement
or any type of conscious researcher selection, the sample selection is not done by

23

Statistics_Method_BOOK.indb 23 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

chance or any probability-based selection method and is called a non-random


sample. Samples like these often produce unrepresentative data and are not
desirable for use in inferential statistics. Non-random sampling techniques are
sometimes justified for defined purposes, such as to investigate knowledge or
attitudes about something specific. Some techniques that follow non-random
selection of data include convenience sampling, judgement sampling, sampling,
voluntary response sampling and snowball sampling.

2.5.1 Convenience sampling


The researcher chooses elements that are readily available, nearby or willing to
participate. It is convenient for the researcher to select the first few sample items
quickly. When both time and money are limited, convenience samples are widely
used. Some examples of these samples are:
• man-in-the-street interviews
• lunch-hour interviews
• interviewing close friends or family
• door-to-door interviews.

2.5.2 Judgement sampling


These samples consist of items deliberately chosen from the population based
on the experience and judgement of the researcher. This method usually results
in making systematic errors in one direction. These systematic errors lead to
what are called biases. For example, four of the most influential economists were
asked to estimate next year’s rate of inflation.

2.5.3 Voluntary response sampling


These samples consist of people who volunteer by responding to a broad appeal,
such as online polls or newspaper questionnaires. People who take the trouble
to respond to an open invitation are usually not representative of any clearly
defined population, because only people with strong opinions are likely to
respond.

2.5.4 Snowball sampling


Sample elements are selected based on referrals from other survey respondents.
The researcher identifies a person who fits the profile wanted for the study. The
researcher then asks this person for the names and locations of others who also fit
this profile. Through these referrals, sample elements can be identified cheaply and
efficiently, which is particularly useful when survey subjects are difficult to locate.

24

Statistics_Method_BOOK.indb 24 2014/12/18 3:01 PM


  Collection of data

2.6 Random sampling
A random sample is one in which the items chosen are based on chance – the
procedure must be such that every element of the population has the same chance
(or probability) of being selected into the sample. Following such a method is
considered ‘fair’ and free from bias, therefore  allowing sample statistics to be
generalised to the whole population from which the sample was taken.  Some
basic random sampling techniques are simple random sampling, systematic
random sampling, stratified sampling and cluster sampling.

2.6.1 Simple random sampling


This technique is the basis for the other random sampling techniques. Each
unit of the sampling frame is numbered from 1 to N (where N is the size of the
population), or an ID number is assigned to each element in the population. Keep
in mind that if the sampling frame is too large, this method will be impractical.

Two of the major random techniques are:


• The ‘goldfish bowl’ or ‘lottery’ technique, which is similar to drawing names
from a hat. This method works well with a small sample. Place a numbered
card for each element in the population in a bowl, mix them thoroughly and
select as many cards as needed in the sample. This method is used often in
lottery draws or where the population is small.
• Table of random numbers. Random number tables consist of rows and
columns in which the numbers 0–9 appear. A random number generator is a
computer program that generates these numbers. Any series of numbers read
across or down the table is considered random.

Example 2.3

Assume that you have 100 employees in a company and you wish to interview
a random sample of 10. Assign every employee a number from 00 to 99. You
assign a two-digit number to each element in the population, and then you can
use two digits of each number from the random number list. The first step in
selecting a sample is to decide where in the random table you should start. Use
the random table given below. You can choose to use the first two digits, the
middle two digits or the last two digits. You can even choose which columns
to use. You can make this decision by using the ‘goldfish-bowl’ technique or by
closing your eyes and pointing to a spot in the table. Suppose you have decided
to start in the first column with the first two digits, and the population consists

25

Statistics_Method_BOOK.indb 25 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

of numbers from 00 to 99. If you reach the bottom of the last column on the
right and are still short of your desired number, go back to the beginning and
start reading the third and fourth digits of each number. According to the table,
employee numbers 70, 23, 20, 22, 53, 39, 48, 64, 12 and 45 will be in the
sample of 10.

Note that if a number occurs more than once, you skip it. You can’t use any
population ID twice because there is a unique ID assigned to each element in the
population.

Activity 2.3

Each student at the university has a mailbox on campus. The mailboxes are
numbered from 0000 to 9000. Use the random number table and select 10
mailbox numbers in your sample. Compare your results with some of the results
obtained by other students in the class and comment on your findings.

Random number table

7081 8887 2876 1705 4260 5065 5528 8241 5997


2318 0139 6986 4900 2408 2027 1676 4382 3370
2099 3526 7912 3824 5108 1033 7363 0183 8479
2293 4424 9209 5979 5022 4849 1960 1771 7961
5359 3108 7453 9978 3538 8963 9562 5437 6806
3971 9260 0760 1284 1020 0961 2666 0255 5957
4833 6395 4528 0665 5386 3539 5918 9165 2088
6492 9493 1058 9069 7725 0094 9513 2735 2915
1227 1585 3239 0593 4703 4737 5851 2551 2824
4505 9108 0031 9578 0077 9836 5817 3221 1174
9515 4576 4486 8388 1343 4507 0031 2209 1921
9889 6933 2616 3883 9008 3389 3672 6952 5839
5737 6911 3388 3682 7271 1110 7272 5674 1650

26

Statistics_Method_BOOK.indb 26 2014/12/18 3:01 PM


  Collection of data

2.6.2 Stratified sampling


Identify non-overlapping groups or strata that share the same characteristics
within the population. Select a simple random sample from each stratum. Make
sure that each of these groups is represented proportionally in the sample. For
example, if a researcher needs to estimate the average mass of a large group of
people, he or she first divides the group into two strata – male and female – and
then selects a proportional simple random sample from each stratum.

2.6.3 Systematic sampling


Select the starting number (a value between 1 and k) at random and each
successive number systematically from an orderly list of the sampling frame to
obtain the sample. Every kth item is selected to produce a sample of size n from a
population of size N. The value of k can be determined by the following formula:
N
k5 ​  n  ​ 
For example, a quality controller selects every 100th smart phone of a specific
brand from an assembly line and conducts a quality test.

2.6.4 Cluster sampling


Some populations have non-overlapping areas or groups which within
themselves represent all of the views of the general population, for example a
town, university or a file of invoices. If this is the case, it will be much more
convenient and cost effective to select one or more of these clusters at random
and then carry out a census within the selected cluster(s). Sometimes the
clusters are too large and a second set of clusters is taken from the originally
chosen clusters. This technique is called multi-stage sampling.
A large geographical area is often divided into more manageable provinces or
clusters. Select a few provinces and then select a few towns from each province.
Out of each town select a few blocks, and out of each block select individual
families at random.

TEST YOURSELF 2

1. ‘How much do you trust information about health that you find on the
Internet?’ You want to ask a sample of 10 students chosen from your class the
question. Describe how you will select your sample using a random method.
2. You want to select a random sample of 25 of the approximately 371 active
telephone area codes covering South Africa. Explain the method you will
use and select your sample.

27

Statistics_Method_BOOK.indb 27 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

3. At a party there are 30 students over the age of 21 and 15 students under
age 21. You want to select a representative sample of five to interview about
attitudes toward alcohol. Explain your method and select your sample.
4. Based on satellite images, a forest area in KwaZulu Natal is divided into 14
types. The area of each type is divided into large sectors. Chose 18 sectors
of each type at random and count the tree species in a 20 3 25 m rectangle
randomly placed within each sector selected. Explain the method you will
use and select the sectors.

Forest type Total sectors Sample size


A 36 4
B 72 7
C 31 3
D 42 4

5. You want to choose four addresses at random from a list of 120 addresses.
Use the systematic method and describe how you will obtain your sample.
6. The New Firearm Policy Survey asked respondents’ opinions about
government regulation of firearms. If you are the researcher, and you want
to follow the telephone interview method using the multi-stage cluster
sampling method, how will you go about selecting your sample?
7. In the 1940s the public was greatly concerned about polio. In an attempt
to prevent this disease Jonas Salk of the University of Pittsburgh developed
a polio vaccine. To test the vaccine 1  000  000 children received the Salk
vaccine and another 1 000 000 a placebo, in this case an injection of salt
dissolved in water. Neither the children nor the doctors performing the
diagnoses knew which children belonged to which group, but an evaluation
centre did. The centre found that the incidence of polio was far lower
among children inoculated with the Salk vaccine. From that information
the researchers concluded that the vaccine would be effective in preventing
polio for all school children and made it available for general use.
Is this investigation an observational study or a designed experiment? Justify
your answer. Is the conclusion of the researchers descriptive or inferential?
8. An inspector of the Department of Health obtains all vitamin pills produced in
an hour at the Herbal Supply Company. She thoroughly mixes them and then
scoops a sample of 10 pills that are to be tested for the exact amount of vitamin
content. Does this sampling design result in a random sample? Explain.

28

Statistics_Method_BOOK.indb 28 2014/12/18 3:01 PM


UNIT Summarising data

3 using tables and


graphs
In this unit we will look at ways to describe data by summarising and displaying
it using tables and graphs so that the salient features of the data set are more
easily understood.

After completion of this unit you will be able to:


• recognise the difference between grouped and ungrouped data
• construct a frequency distribution
• draw graphs based on qualitative and discrete data
• draw graphs based on continuous data
• recognise the usefulness of visual aids in presenting data.

When data is collected the initial result is usually a list of the observations for
each variable. This is referred to as raw data. Raw data has not been processed
and provides little information. Statistics give us some tools or techniques to
organise and summarise the raw data into tables and graphs. Data in this format
is easy to understand because it focuses on the key characteristics only.

The steps to follow in summarising data in tables and graphs


1. Order the data into a logical sequence.
2. Summarise data by arranging it in the form of a table known as a frequency
distribution. A table is a statistical tool used to present data in vertical
columns and horizontal rows according to some classification.
3. Present it in an attractive way, using graphs or diagrams. The choice of
presentation depends on the type and complexity of the data and the
requirements of the user.

A graph shows the relationship between two variables: one will be the x-variable
on the horizontal axis and the other the y-variable on the vertical axis. A graph
does not replace a table, but complements it by showing the data’s general

Statistics_Method_BOOK.indb 29 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

structure more clearly and revealing trends or relationships that might be


overlooked in a table. It is more likely to get the attention of the casual observer.

3.1 Summarising qualitative data in tables and


graphs

3.1.1 Frequency distributions


A frequency distribution is a table that records each category, value or interval
of values that a variable might have and the number of times (frequency) that
each one occurs in the data set.
If you are interested in the proportion (fraction or percentage) of times a
value or category occurs, calculate the relative frequency. If the table includes
relative frequencies, it is referred to as a relative frequency distribution.
frequency
Relative frequency 5 
​  total frequency
      ​

Steps
1. Draw a column in which each row lists one of the categories for the variable
of interest.
2. Draw a second column to list the corresponding number of times that the
category occurs (frequency f).
3. Add up the frequency column to make sure that the total is the same as the
number of observations.
4. The order for the categories in the frequency table is not important, unless
there is a logical order in the given data set.
5. Interpret the table results.

Example 3.1

Toni’s Supermarket has received many complaints about the condition of


long-life milk in cardboard boxes. Customers are refusing to buy boxes that are
damaged because they don’t know whether the contents are still intact. Since
Toni is fairly sure that the damage is not occurring when the boxes are put on
the shelves, he decides to check the cases as they arrive from the distributor. He
takes a random sample of boxes of milk as they arrive and examines them for
various defects. The sample of 28 boxes provides the following data:

30

Statistics_Method_BOOK.indb 30 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Unsealed No defect Dented Crushed No defect Crushed Unsealed


No defect Dented No defect No defect Dented No defect No defect
Dented No defect No defect Crushed Crushed Crushed No defect
Crushed No defect No defect Dented No defect Crushed No defect

1. Create a frequency table for the data and determine if Toni’s concerns are
justified.
2. Change the frequency distribution to a relative (%) frequency distribution by
dividing each frequency by the total frequencies.

Category Frequency (f) %f


Unsealed 2 7
Crushed 7 25
Dented 5 18
No defect 14 50
Total 28 100

Conclusions
1. The table shows that half of the boxes that arrived were damaged, which is
definitely a matter for concern. Only two of the boxes were unsealed and can
be considered as unsafe for use.
2. 50% of the boxes are damaged with half of the damaged boxes crushed.

Activity 3.1

A biokinetics instructor wants to study the different types of rehabilitation


required by her patients. She selects a simple random sample of her patients
and records the body part requiring rehabilitation. The following results are
obtained:

Hand Back Ankle Shoulder Back Back


Back Shoulder Back Wrist Knee Knee
Neck Ankle Hip Knee Back Neck
Wrist Shoulder Back Back Back Shoulder
Knee Back Back Knee Hand Wrist

31

Statistics_Method_BOOK.indb 31 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Construct a frequency distribution and a relative frequency distribution to


describe the data. Give a short interpretation of your results.

3.1.2 Cross-tabulation
Data resulting from observations made on two different related categorical
variables (bivariate) can be summarised using a table known as a two-way
frequency table or contingency table.
The word ‘contingency’ is used because the table is used to determine if there
is an association between the variables.

Steps
1. This table displays the one variable (x) in the rows and the other variable (y)
in the columns.
2. Each row and column combination in the table is called a cell.
3. The number of times each (x, y) combination occurs in the data set is
recorded and these numbers are entered in the corresponding cells of the
table. These are known as the observed cell counts.
4. Add the observed cell counts in each row and also in each column of the
table to obtain the marginal totals.
5. The grand total is the total of all the observed cell counts in the table.
All the row marginal totals will add up to the grand total. All the column
marginal totals will also add up to the grand total.
6. We use a contingency table if we want to compare two different populations
on the basis of a single categorical variable, or when two categorical variables
are observed in a single sample. For example, data could be collected at a
university to compare students, staff and management on the basis of their
means of transport to campus (taxi, bus, car, train, motorcycle, bicycle or
on foot). This will result in a (3 3 7) two-way frequency table with row
categories of Student, Staff and Management, and column categories
corresponding to the seven possible modes of transport. The observed cell
counts could then be used to gain insight into differences and similarities in
means of transport in the three groups.

32

Statistics_Method_BOOK.indb 32 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Activity 3.2

People believe that organic foods are healthier than conventionally grown fruit
and vegetables. An investigation is carried out on a sample of 10 000 food items
by the local health department as part of the regulatory monitoring of foods for
pesticides residues. The following table displays the frequencies of foods for all
possible category combinations of the two variables: food type and pesticide status.

Pesticides
Food type Present Not present Total
Organic 28 99 127
Conventional 9 085 788 9 873
Total 9 113 887 10 000

Briefly comment on these results.

Activity 3.3

One hundred students majoring in Sciences were classified according to gender


and year of study. Ten were first-year women, 20 were senior women, 40 were
first-year men and 30 were senior men. Arrange the data in a contingency table.
Briefly comment on your result.

3.1.3 Bar graph for a single data set


A bar graph is a quick and easy way of showing variation in or between variables.
It is made up of a series of bars arranged either vertically or horizontally.
One of the axes is used to represent the categories in the frequency table and
the other axis is used to represent the frequencies or relative frequencies. Single
bars representing each variable are drawn either vertically or horizontally.
Assignment of axes is a matter of preference, but for the purpose of uniformity
we will use the horizontal x-axis to represent the categories.

Steps
1. Communicate only a single idea or variable.
2. Draw a pair of axes, x and y.

33

Statistics_Method_BOOK.indb 33 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

3. Label the axes and give the graph a title.


4. At evenly spaced intervals on the x-axis put tick marks and label them with
the categories from the frequency table.
5. Scale the y-axis so that it can accommodate the category with the highest
frequency or relative frequency. Whenever you use a change of scale in a
graph, indicate it by using a squiggle or //.
6. At each category on the x-axis draw a bar with its length equal to the
frequency or relative frequency for the variable it represents.
• The bars must all have the same width.
• Make the bars reasonably wide so that they can be clearly seen.
• The gaps between the bars must have the same width – the bars should
not touch each other.
7. This type of graph not only illustrates a general trend but also allows a
quick and accurate comparison of one period with another or illustrates a
situation at a particular point in time.
8. If you arrange the bars in descending order, the graph is called a Pareto Chart.
By arranging bars in order of frequencies, attention is drawn to the more
important categories.

Example 3.2

In a recent study done on a random sample of 75 teenage boys the following


data was collected:

Fruit servings per day Number of boys % of boys


0 20 27
1 15 20
2 15 20
3 12 16
4 8 11
5 5 6

1. Display the data in a bar graph.


2. If the number of recommended servings per day is at least three, what
percentage of the boys ate fewer than three servings per day?

34

Statistics_Method_BOOK.indb 34 2014/12/18 3:01 PM


  Summarising data using tables and graphs
25

20

Number of boys
15

10

0
0 1 2 3
Number of fruit servings

Conclusion: 67% of the boys ate fewer than the recommended number of servings.

Activity 3.4

Draw a simple bar chart showing the ages of employees and draw conclusions
from your results.

Age Number of employees


20 11
21 4
22 8
23 6
24 5

3.1.4 Comparative bar graphs


To compare two or more data sets, bars are grouped together in each category
(multiple bar graphs) or stacked for each category. Use the relative frequency
rather than the frequency on the vertical axis to enable you to make meaningful
comparisons even if the sample sizes are not the same. The use of a key will help
distinguish between the categories.

Multiple bar graph

Steps
1. Draw a pair of axes, x and y.
2. Label the axes and give the graph a title.
3. At evenly spaced intervals on the x-axis put tick marks and label them with
the categories from the frequency table.

35

Statistics_Method_BOOK.indb 35 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

4. Determine the relative frequency of each category if needed.


5. Scale the y-axis so that it can accommodate the category with the highest
frequency or relative frequency.
6. At each category on the x-axis, group the bars for the different data sets
together and draw rectangles with heights equal to the relative frequency
for the data set each represents.
7. Use a key or label to distinguish between the different data sets.
8. Interpret your graph.

Example 3.3

The contingency table below summarises the responses of two different groups
to their perceived risk of smoking. Portray the data using a multiple bar graph to
determine whether smokers and former smokers perceive the risks of smoking
differently.

Risk of smoking Smokers Former smokers


f %f f %f
Very harmful 145 65 204 81
Not too harmful 79 35 47 19
Total 224 100 251 100

100

80

60
%f

40

20

0
Very harmful Not too harmful
Smokers Former smokers

The graph shows that the proportion of former smokers who believe that
smoking is very harmful is larger than the proportion of smokers who believe
that smoking is very harmful. In other words, smokers are less likely to believe
that smoking is very harmful than former smokers.

36

Statistics_Method_BOOK.indb 36 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Activity 3.5

Draw a multiple bar chart showing the ages of male and female employees and
comment on your result.

Age Male Female


20 3 8
21 1 3
22 4 4
23 2 4
24 1 4

Segmented or stacked bar chart


This bar chart is particularly useful if you want to emphasise the relative
proportions of each component that makes up the category.

Steps
1. Draw a single bar for each category, with the height of the bar representing
the total of each category.
2. Subdivide each bar to show the components that make up each category.
3. Identify the components involved by colouring or fill effects, accompanied
by an explanatory key to show what each colour or fill effect represents.
4. Interpret your results.
5. If the components are converted to percentages of the total of each
category, the bars are divided in proportion to these percentages. The scale
is a percentage scale and the height of each bar is then 100%. This is known
as a percentage component bar graph.

Example 3.4

The contingency table below summarises the responses of two different groups to
their perceived risk of smoking. Portray the data using a percentage component
bar graph to determine whether smokers and former smokers perceive the risks
of smoking differently.

37

Statistics_Method_BOOK.indb 37 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Risk of smoking Smokers Former smokers Total


f %f f %f
Very harmful 145 42 204 58 349
Not too harmful 79 63 47 37 126

100%

80%

60%
%f

40%

20%

0%
Very harmful Not too harmful
Smokers Former smokers

In comparing the two columns on the percentage component bar chart we


can conclude that the proportion of smokers who believe that smoking is very
harmful is smaller than the proportion of former smokers who believe that it is
very harmful. A larger proportion of smokers believe that smoking is not too
harmful than the proportion of former smokers who believe that it is not too
harmful.
400

300

200
f

100

0
Very harmful Not too harmful
Smokers Former smokers

From the stacked bar chart you can conclude that there are more smokers and
former smokers who believe that smoking is very harmful than those who believe
it is not too harmful.

Activity 3.6

Draw a stacked bar chart showing the ages of male and female employees and
comment on your result.

38

Statistics_Method_BOOK.indb 38 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Age Male Female


20 3 8
21 1 3
22 4 4
23 2 4
24 1 4

3.1.5 Pie chart


A circle graph or pie chart represents the data set in the form of a circle divided
into ‘slices’ representing the possible categories. This allows a quick overall view
of the relative sizes of the categories, but offers little potential for comparison.
Pie charts are most effective for relatively simple representations and summarising
data sets when there are not too many categories.

Steps
1. Draw a circle to represent the entire data set.
2. Keep the categories to 10 or fewer.
3. For each category calculate the ‘slice’ size.
4. A circle has 360° and ‘slice’ sizes are calculated as a proportion of 360°.
‘Slices’ are drawn by making use of a protractor.
5. Put any labelling outside the circle.
6. Look for categories that form large and small proportions of the data set
when interpreting the chart.

Example 3.5

A random sample of 2 000 shoppers was asked why they were visiting a shopping
centre on a specific day.

Number of shoppers %f °
Groceries 790 0.395 142
Clothing 570 0.285 103
DIY 580 0.29 104
Other 60 0.03 11
Total 2 000 1 360

39

Statistics_Method_BOOK.indb 39 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills
A pie chart showing the main purpose of shopping

Groceries
Clothing
DIY
Other

The majority of shoppers on that specific day wanted to buy groceries. Equal
proportions wanted to buy clothing or DIY items and only a few people were
there for other purposes.

Activity 3.7

Here is how you might divide up your day:

Travelling Working Eating Sleeping Other Social life


10% 30% 10% 28% 7% 15%

Draw a pie chart to portray the data and comment on the results.

3.1.6 Pictograms
Pictograms are small symbols or simplified pictures that represent data.

Steps
1. Give the pictogram a title.
2. Choose a simple symbol or picture that is easy to draw.
3. The quantity that each symbol represents should be given.
4. It is important that the symbols are all the same size. It is possible to use half
a picture to represent half the quantity.
5. Draw the symbols neatly and professionally.

40

Statistics_Method_BOOK.indb 40 2014/12/18 3:01 PM


  Summarising data using tables and graphs

The number of telephone calls received


(1 unit 5 100 calls)
January 5
February 5
March 5

3.2  Summarising quantitative data in tables

3.2.1 The ordered array of data


If there are not too many observations, we can use the collected data in its raw
form, known as ungrouped data.
A first step in organising ungrouped data is to arrange the data in an array
– that means to sort the data in numerical order from small to big. By looking
at an ordered array you can get a feel for the dimension of the data. Data must
be in order for a variety of statistical procedures, such as finding the median,
percentiles or quartiles.

Example 3.6

Arrange the following data in an array:


4 80 50 10 5
Array: 4 5 10 50 80

Activity 3.8

Arrange the following numbers in an array:


67 23 56 45 56 41 34 33 0 18 23

3.2.2 Dot plot


This method can be used for relatively small data sets (usually not more than 20
observations) and portrays individual observations.

Steps
1. Construct a single horizontal axis and label it with the name of the variable.

41

Statistics_Method_BOOK.indb 41 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

2. Mark the axis with an appropriate measurement scale to fit the smallest as
well as the largest value in the data set.
3. For each observation, place a dot above its value on the number line.
4. If there are two or more observations with the same value, stack the dots
vertically.
5. The number of dots above a value on the number line represents the
frequency of occurrence of that value.

Example 3.7

The purpose of a study is to investigate how much sugar and how much sodium
(the main ingredient of salt) is in breakfast cereals. The following table lists 15
popular cereals and the amounts of sodium and sugar contained in a single
serving of 180 ml.

Cereal Sodium (mg) Sugar (g) Cereal Sodium (mg) Sugar (g)
A 290 2 I 250 10
B 200 3 J 125 14
C 230 3 K 220 3
D 125 13 L 0 7
E 260 5 M 220 12
F 200 11 N 170 3
G 210 12 O 140 10
H 140 10

Construct a dot plot for the sodium values of the breakfast cereals.

0 120 140 160 180 200 220 240 260 280 300

What does the dot plot tell us about the data?

The dot plot gives us an overview of all the data. We see that the sodium values
fall between 0 and 290 mg, with most cereals falling between 125 and 250 mg.

42

Statistics_Method_BOOK.indb 42 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Activity 3.9

1. Construct a dot plot for the sugar values of the breakfast cereals.
2. What does the dot plot tell us about the data?

3.2.3 Stem-and-leaf plot


This graph portrays the individual observations and provides a fast procedure
for arranging data in order and showing the shape simultaneously. Use it for
data sets with a small to moderate number of observations.
An advantage of this method is that all the information in the original data
list is shown and, if necessary, we could reconstruct the original list of values.
The stem-and-leaf plot represents data by separating each value into two
parts: the stem and the leaf.
The stems are the leading digit or digits and are displayed in a vertical position
on the left-hand side of a vertical line. Usually the stem consists of all the digits
except for the final one, which is the leaf or trailing digit. To display the value 76
into this format, the 7 will be the stem and the 6 will be the leaf.

Stem Leaf
7 6
Units of measure: Stem: tens
Leaf: ones

Steps
1. Select one or more leading digit(s) for the stem values. You can choose the
digits to serve as the stem, but keep them constant for all the stems.
2. Find the smallest number and the largest number in the distribution of
numbers. These will give the first stem and the last stem.
3. List all possible stems in increasing order to the left of the line.
4. The trailing digit(s) become the leaves.
5. Record the leaf for every observation beside the corresponding stem value.
6. Place the leaves with the same stem on the same row as the stem.
7. Arrange the leaves in each row from lowest to highest to form a stem-and-
leaf plot.

43

Statistics_Method_BOOK.indb 43 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

8. Use a label to indicate the units for stems and leaves in the display.
9. Count the number of leaves per row and enter the answer in a column next
to the display. That is the frequency of each row.
10. The display conveys information about:
• a representative or typical value in the data set
• the extent to which the data values are spread out
• the presence of any gaps in the data
• the extent of symmetry in the distribution of values
• the number and location of peaks
• the presence of unusual values (outliers) in the data set.
11. When the stem has many leaves it does not clearly portray where the data
falls. In this case it is useful to split each stem in two: putting leaves from 0
to 4 on the first stem and from 5 to 9 on the second stem.
12. To make a stem-and-leaf plot more compact we can remove the last digit. For
example, 0.311, 370 and 125 will become 0.31, 37 and 12. Just remember
to indicate the correct unit for the leaves: for instance, in the case of 125, if
the 5 falls away, the stem will be 1 with unit hundred and the leaf will be 2
with unit ten.

Example 3.8

Construct a stem-and-leaf plot for the test marks obtained by a sample of 20


students.

78 82 96 74 52 68 82 78 74 76
88 62 66 76 76 84 95 91 58 86
1. The smallest number is 52 and the largest number is 96. Use the first digit
(the tens) in each number as the stem and the last digit (the units) as the leaf.

Stem Leaf
5
6
7
8
9

44

Statistics_Method_BOOK.indb 44 2014/12/18 3:01 PM


  Summarising data using tables and graphs

2. Place each leaf on its stem by placing the trailing digit of each data value on
the right side of the vertical line opposite its corresponding leading digit (stem).
The first value is 78 with 7 the stem and 8 the leaf. Thus, we place 8 opposite
the stem 7.

Stem Leaf
5 28
6 826
7 8484666
8 22684
9 651

3. Order the trailing digits (leaves) in each row from lowest to highest to form a
stem-and-leaf plot.

Stem Leaf
5 28
6 268
7 4466688
8 22468
9 156

4. To focus on the shape indicated by the stem-and-leaf plot use a rectangle to


contain the leaves of each stem and rotate the page onto its side. A picture
similar to a histogram is seen.
The general shape is almost symmetrical around the seventies and the
majority of the students obtained marks of 70 and above.
5. Count the number of leaves per row and enter the answer in a column next
to the display. That is the frequency of each row.

Stem Leaf Frequency


5 28 2
6 268 3
7 4466688 7
8 22468 5
9 156 3
20

45

Statistics_Method_BOOK.indb 45 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Activity 3.10

The following is an array of the daily litres of used sunflower oil bought by a bio-
diesel plant. Construct a stem-and-leaf plot for the data.
58 63 69 69 70 71 71 72 72 72
73 73 74 75 77 79 80 62 84 84
85 88 91 91 91 94 96 97 99 100

3.2.4 Frequency distribution tables


The frequency table condenses raw data into a more manageable form that will
increase our ability to detect pattern and meaning. This is done by keeping count
of how many times a particular value occurs, known as the frequency.

Ungrouped frequency distribution


To demonstrate the concept of a frequency distribution we use a set of quantitative
data and group it into an ungrouped frequency distribution – ‘ungrouped’
because each x-value in the distribution stands alone.

Example 3.9

15 6 14 15 4 15 17 6 18 15
An array of the x-values and the number of times each one occurs (frequency f ):

Value (x) Frequency (f )


4 1
6 2
14 1
15 4
17 1
18 1

The value 15 occurred four times; therefore it has a frequency of 4.

Activity 3.11

1. From example 3.9 above:

46

Statistics_Method_BOOK.indb 46 2014/12/18 3:01 PM


  Summarising data using tables and graphs

a) Which values occurred only once?


b) Which value occurs most often?
c) How many values are in the distribution? Count the number of values in
the given data set and compare it with the total of all the frequencies.
2. Form an ungrouped frequency distribution of the following data and
comment on the frequency of each value:
1 2 1 4 0 2 0 1 4 1 6

Grouped frequency distribution


If the number and range of observed values is relatively large, you will have a
fairly lengthy list of data, which is not easy to interpret. It is then necessary to
summarise the data in a grouped frequency distribution by grouping adjacent
x-values into intervals, known as classes. In summarising the values like this
we lose the detail of individual values, but it makes the data much easier to read
and understand.
A grouped frequency distribution is a summary of numerical data grouped into several
non-overlapping class intervals, showing the number of observations (frequency) in
each interval.
Data organised into a frequency distribution using class intervals is called
grouped data.
Although there are no absolute rules for constructing a frequency table, you
can apply some guidelines to help you.

Construction of a frequency distribution

Steps
1. Determine the range of the given ungrouped or ‘raw’ data. The range (R) is
the difference between the largest and smallest values in the data set.
2. Determine the number of class intervals (K). Frequency tables should
contain between five and 20 classes. As a guideline, the number of classes
(K) should be approximately equal to the square root of the sample size, n.

K 5 ​ number of observations ​
   
Round the answer up to the next whole number.
3. Determine the width (c) of the class interval, which is the range divided by
the number of classes.

47

Statistics_Method_BOOK.indb 47 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

c5 ​  R
K ​ 
This answer should be rounded to a whole number (either up or down) or to
the same number of decimals as the raw data.
4. Test: the number of intervals multiplied by the width must always be larger
than the range.
(K 3 c . R)
5. Choose the lower and upper class boundaries of each interval to indicate
the smallest and largest data values that will fall into each class. The classes
must span the entire data set and must not overlap.
Begin by choosing a number for the lower boundary of the first class. Choose
either the lowest data value or a convenient value that is a little smaller. Add
the class width (c) to this value to get the second lower class boundary. Add the
class width to the second lower class boundary to get the third, and so on. List
the lower class boundaries in a vertical column. The upper class boundary of
the first interval is the same as the lower class boundary of the second interval.
The last class ends at a value more than the highest number in the range.
6. Sort the raw data into the classes by making use of the tally method. The tally
method is a method of counting data that falls into each interval. Examine
each data value and determine which class contains the data value. Make a
tally mark or vertical stroke beside that class. For ease of counting, each fifth
tally mark of a class is placed across the prior four marks (| | | | rather than
| | | | |). Observations that fall exactly on the lower class boundary stay
in that interval; observations that fall exactly on the upper class boundary
go into the next higher class interval. A class contains all observations
from the lower boundary of the class up to but not including the
upper boundary.
7. Count the number of tallies (observations) in each class to obtain the
frequency (f ) for each class.
8. The sum of the frequencies for all class intervals must equal the number of
original data values.
9. It is possible to come to some conclusions, such as: in which class do you find
the majority of the values or the least number of values?

Notes
1. The number of classes should be small enough to provide an effective
summary, but large enough to display the relevant characteristics of the data.

48

Statistics_Method_BOOK.indb 48 2014/12/18 3:01 PM


  Summarising data using tables and graphs

2. Class boundaries must be selected in such a way that the smallest value is
included in the first interval and the largest value in the last interval.
3. Avoid overlapping of intervals so that an observation falls in one class only.
4. The width of all classes should be equal.
5. Open-ended class intervals should be avoided, although they may be useful
when a few values are extremely large or small in comparison with the rest
of the values.
6. Class intervals with a frequency of 0 should be avoided.

Example 3.10

Research shows that a possible cancer-causing substance, acrylamide (AA),


forms in high-carbohydrate foods cooked at high temperatures and that the
AA levels can vary widely even within the same brand of food. The researchers
analysed Big Mac’s potato fries sampled from different franchises and found the
following AA levels:

366 155 326 187 245 270 319 223 212 190
193 247 255 235 300 311 180 333 289 245
328 201 260 259 263 313 151 322 270 299

Construct a frequency distribution for the AA levels.


• Range: 366 2 151 5 215

• Number of intervals: K 5 ​ 30 ​ 5 5.47 ≈ 6
• Width of interval: c 5215 4 6 5 36.83 ≈ 36
If you choose the width as 37, the test (K 3 c) 5 6 3 37 5 222
If you choose the width as 36, the test (K 3 c) 5 6 3 36 5 216 which is . range
of 215, but much closer than 222.

Class intervals Tally Frequency (f )


151 2 ,187 | | | 3
187 2 ,223 | | | | 5
223 2 ,259 | | | | | 6
259 2 ,295 | | | | | 6
295 2 ,331 | | | | | | | 8
331 – ,367 | | 2
Total f 30

49

Statistics_Method_BOOK.indb 49 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

The sample from 30 franchises has been counted into six classes, with a
width of 36 each. For example, 151 up to just under 187 is the first class
interval, the two numbers 151 and 187 are the class boundaries and 3 (the
number of franchises) is the frequency of that class. This means that in three
of the franchises the AA levels in the French fries were between 151 and just
under 187.

Note: The Greek capital letter sigma (S) stands for ‘sum the appropriate values’.
Thus we write 1 1 2 1 3 1 4 1 . . . 1 n as
n

  
i50
   ​
​ ​xi

This means the sum of all the x values from 1 to n. This index system must be
used whenever only part of the available information is to be used. In statistics,
however, we usually use all the available information and the notation will be
adjusted by doing away with the index system
n


​  ​x

i51
  5 Sx​
i

Activity 3.12

Look again at the data in example 3.10:


1. The number of franchises with AA levels between 259 and 295 is . . .
2. The frequency for the class with AA levels between 187 and 223 is . . .
3. The upper boundary of the first class is . . .
4. The lower boundary of the third class is . . .
5. The total number of observations in the data set is . . .

Activity 3.13

A study was recently carried out to determine the amount of time that non-
secretarial office staff spend using computer terminals. The study involved
50 staff and the times spent using computers, in hours per week, were as
follows:

50

Statistics_Method_BOOK.indb 50 2014/12/18 3:01 PM


  Summarising data using tables and graphs

1.2 4.8 10.3 7.0 13.1 16.0 12.7 0.5 5.1 2.2
8.2 0.7 9.0 7.8 2.2 1.8 5.2 14.1 5.5 13.6
12.2 12.5 12.8 13.5 2.5 5.0 15.5 2.5 3.9 6.5
4.2 8.8 7.5 14.4 10.8 16.5 2.8 9.5 17.0 10.5
12.5 10.5 16.0 14.9 0.3 11.6 12.8 17.7 18.0 22.0

Construct a frequency distribution for the data.

Relative frequency distribution


When the proportion of observations in each class interval instead of the
actual number of observations is recorded, the distribution is known as a
relative frequency distribution. Relative frequency distributions are useful
for comparing two data sets, especially when the sample sizes or measurement
scales differ substantially. A relative frequency of a class is the observed
frequency of the class divided by the total number of observations in the data
set. If a percentage is required, multiply the result by 100.

Class midpoint or class mark


The class mark or midpoint (x) divides a class interval into two equal parts
and is obtained by adding the upper and lower boundaries of each class interval
and dividing the result by two. This middle value represents the class interval in
calculations.

Cumulative frequency distribution


Knowledge of the number of observations that lie below or above a certain value
is often desired. A cumulative ‘less than’ frequency for a class is the sum of
the frequencies for that class and all previous classes. We read it as the total of
all the frequencies less than the upper boundary of each interval. A cumulative
relative frequency distribution is a ratio calculated by dividing a cumulative
frequency of a class by the total number of observations in the data set. If a
percentage is required, multiply the result by 100.

Example 3.11

The following is a frequency table showing the AA levels in the potato fries from
a sample of Big Mac’s outlets:

51

Statistics_Method_BOOK.indb 51 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Class intervals Frequency (f) %f x cum < f % cum < f


151 2 ,187 3 10 169 3 10
187 2 ,223 5 17 205 8 27
223 2 ,259 6 20 241 14 47
259 2 ,295 6 20 277 20 67
295 2 ,331 8 26 313 28 93
331 2 ,367 2 7 349 30 100
30 100

Interpreting interval 2: 17% of the outlets have AA levels in the potato fries
of between 187 and 223. Eight of the outlets have AA levels of less than 223
representing 27% of the outlets.

Activity 3.14

Use your frequency table from activity 3.13 and construct a relative frequency
distribution, a cumulative frequency distribution, a relative cumulative
frequency distribution and the class midpoints.

3.3  Summarising quantitative data using graphs

3.3.1 The histogram and relative histogram


A histogram is a continuous series of rectangles of equal width but different
heights drawn to display class frequencies.

Steps
1. Mark the class boundaries on the x-axis. The class intervals are equal in
width; therefore the points must be equidistant from one another.
2. Use either f or % f on the y-axis. A proper scale showing the true zero must
be used on the y-axis in order not to misrepresent the character of the data.
3. Whenever the zero point on the horizontal axis is not in its usual position
at the intersection of the horizontal and vertical axis, the symbol // or some
similar symbol should be used to indicate that.
4. Draw a rectangle for each class directly above the corresponding interval.
The height of each rectangle is the frequency (or relative frequency) of the
corresponding class.

52

Statistics_Method_BOOK.indb 52 2014/12/18 3:01 PM


  Summarising data using tables and graphs

5. There are no gaps between the bars of the histogram.

To interpret the histogram you must look for:


• the overall pattern and obvious deviations from this pattern (the overall
pattern can be described by its shape, centre and spread)
• the location and number of peaks
• the presence of gaps and outliers.

Possible shapes of the histogram:


A distribution is symmetric if the right-hand side is a mirror image of the left-
hand side:

A distribution is skewed to the right if the ‘tail’ (larger values) extends much
farther out to the right :

A distribution is skewed to the left if the ‘tail’ (smaller values) extends much
farther out to the left :

53

Statistics_Method_BOOK.indb 53 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

A distribution is uniform if the frequency of each class is the same and the bars
of the histogram have the same length.

Example 3.12

Draw a histogram showing the AA levels in the potato fries from a sample of Big
Mac’s outlets.

AA levels Frequency (f ) %f x cum <f cum < %f


151 2 ,187 3 10 169 3 10
187 2 ,223 5 17 205 8 27
223 2 ,259 6 20 241 14 47
259 2 ,295 6 20 277 20 67
295 2 ,331 8 27 313 28 93
331 2 ,367 2 7 349 30 100
30 100

Histogram
9

f 5

0
151 187 223 259 295 331 367
Class boundaries showing acrylamide levels

54

Statistics_Method_BOOK.indb 54 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Activity 3.15

Use your frequency table from activity 3.13 to construct a histogram.


Note: For an ungrouped frequency distribution, the data classes consist of a
single value. To draw the histogram we place the middle of each histogram bar
over the single value represented by the class.

3.3.2 Polygon and relative polygon


The polygon is a line graph that can also be used to portray the shape of the
distribution.

Steps
1. The frequency distribution must show the class midpoint (x) of each class.
2. Mark the class midpoints on the x-axis.
3. Mark the frequencies on the y-axis using a proper scale and preferably
starting at the zero point. The scale must include values large enough to
include the largest frequency.
4. Plot each midpoint together with its corresponding frequency.
5. Connect the successive dots with a straight line to form the polygon.
6. Frequency polygons begin and end on the horizontal axis with a frequency
of zero. On the left end, plot a point one class width to the left of the first
midpoint with a frequency of zero. On the right end, plot a point one class
width to the right of the last midpoint with a frequency of zero.

A polygon that uses the relative frequencies of the intervals rather than the
actual number of points is called a relative polygon. It has the same shape as the
frequency polygon, but uses a percentage scale on the y-axis.

Example 3.13

Below is a frequency table showing the AA levels in the potato fries from a sample
of Big Mac’s outlets. Draw the polygon for this distribution.

55

Statistics_Method_BOOK.indb 55 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Intervals Frequency (f ) x cum <f


151 2 ,187 3 169 3
187 2 ,223 5 205 8
223 2 ,259 6 241 14
259 2 ,295 6 277 20
295 2 ,331 8 313 28
331 2 ,367 2 349 30
30

Polygon:
8
Number of franchises

0
169 205 241 277 313 349

Acrylamide levels (midpoints)

Activity 3.16

Use your frequency table from activity 3.13 to construct a polygon.

3.3.3 Ogive (cumulative curve) and relative ogive


An ogive is a smooth curve that can be used to estimate graphically the number
of observations below or above a set level. Therefore an ogive requires cumulative
class frequencies.

Steps
1. The frequency distribution must show class boundaries and cumulative
frequencies.
2. The frequency scale on the y-axis must extend to the total of the frequencies.

56

Statistics_Method_BOOK.indb 56 2014/12/18 3:01 PM


  Summarising data using tables and graphs

3. Mark the class boundaries on the x-axis.


4. For each class, plot the upper boundary together with the cumulative ‘less
than’ class frequency.
5. The ‘less than’ ogive begins on the horizontal axis (cum f 5 0) at the lower
class boundary of the first class.
6. Draw a smooth curve through the points. The ‘less than’ curve slopes
upwards and to the right.
7. If the cumulative frequencies are expressed as percentages of the total, a
relative ogive can be drawn.

Example 3.14

Below is a frequency table showing the AA levels in the potato fries from a sample
of Big Mac’s outlets. Draw an ogive for this distribution.

Classes Frequency (f ) x cum f


151 2 ,187 3 169 3
187 2 ,223 5 205 8
223 2 ,259 6 241 14
259 2 ,295 6 277 20
295 2 ,331 8 313 28
331 2 ,367 2 349 30
30

Ogive
30
Number of outlets

20

10

0
151 187 223 259 295 331 367

Acrylamide levels

57

Statistics_Method_BOOK.indb 57 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Conclusion: Approximately eight of the franchises have AA levels of less than


223. That means that the other 22 outlets have levels above 223.

Activity 3.17

Use your frequency table from activity 3.13 to construct an ogive.

3.4 Using software
There are a number of useful software packages available for data presentation
and most of them are simple and easy to use.
Computers can help you develop your ideas about how to organise the
information by using a ‘try and refine’ approach, which would take too long to
carry out manually. For example, if you decide to break the information down
in a certain way and the results are not what you need, it is a simple matter to
create new ways and experiment again.
Computer software can produce accurate and professional graphs and charts
from data, but these are only as useful as the data and instructions used to make
them.

TEST YOURSELF 3

1. A questionnaire about how people get news resulted in the following


information from a sample of 25:
N 5 newspaper; T 5 television; R 5 radio; M 5 magazine

N T N R N T N R N
R T M R M M N M
M N R T R R T M

Summarise the results in a frequency table and construct a pie chart.


Interpret your results.
2. The blood of 25 children was tested to determine their blood types. Construct
two simple bar charts to display the data using the frequency in one and the
%f in the other.

58

Statistics_Method_BOOK.indb 58 2014/12/18 3:01 PM


  Summarising data using tables and graphs

Blood type Frequency Percentage


A 5
B 7
O 9
AB 4

a) According to the data, which blood type is most common?


b) According to the data, which blood type is least common?
3. In South Africa a background check is done on any applicant who applies
for a firearm licence. The given table categorises the licence applicants who
were denied because of a criminal history for the period 2011 and 2012.

Criminal history Frequency 2011 Frequency 2012


Domestic violence 64 800 69 100
Felony offence 254 880 215 000
Drug-related offence 30 240 34 029
Other 82 080 73 124

a) Construct a simple bar chart to portray the data for each year.
b) Construct a stacked bar chart to portray the data.
c) Construct a multiple bar chart to portray the data.
d) Comment on your results.
4. A recent newspaper article ‘The need to be connected’ described the results
of a survey of 1  000 adults who were asked about how various essential
technologies, including personal computers, cell phones and DVD players,
influenced their daily lives. The table summarises the responses:

Response PC Cell phone DVD player

Cannot live without 47% 42% 18%


Would miss, but could do without 27% 26% 35%
Could definitely live without 26% 32% 47%

Construct a comparative bar graph to portray the responses for the different
technologies.
5. In the manufacture of printed circuit boards, finished boards are subjected
to a final inspection before they are distributed to customers. The type of
defect for each board rejected at this final inspection during a randomly
selected day is listed together with the frequency of occurrence:

59

Statistics_Method_BOOK.indb 59 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Type of defect Defects during Defects during afternoon


morning inspection inspection
Etching 5 7
Lamination 10 7
Plating separation 9 8
Poor electrodes 36 40
coverage
Low copper plating 122 130

Construct a comparative bar graph to portray the inspections for the


different times of the day.
6. Illustrate the following data by means of a multiple and a stacked bar chart.

Johnson and Co. Analysis of costs (R’000)


Year Wages Raw materials Overheads
2005 15 10 3
2006 14 12 3
2007 10 11 4
2008 11 8 4

7. Use a pie chart to illustrate the following data.

Waxman and Co.


Analysis of employment hours lost
Hours lost through illness 5 100
Hours lost through holidays 2 150
Hours lost through industrial disputes 8 750
Total 16 000

8. A survey of 3 000 adults asked ‘How accurate are the weather forecasts in
your area?’ The responses are summarised in the given table:

Extremely accurate 5%
Very accurate 26%
Sometimes accurate 55%
Not too accurate 9%
Not at all accurate 4%
Not sure 1%

60

Statistics_Method_BOOK.indb 60 2014/12/18 3:01 PM


  Summarising data using tables and graphs

a) Construct a pie graph to portray the data.


b) Construct a bar graph to portray the data.
c) Comment on your answers.
9. The volumes of water (in litres) consumed by 24 elephants in one day are
listed below:

66 90 68 94 86 96 70 138 90 120 92 102


82 120 132 82 64 80 88 78 92 66 106 106
Summarise the data set using a stem-and-leaf plot and comment on your
results.
10. The number of persons who volunteered to donate blood at a shopping
centre was recorded for each of 20 successive Saturdays. The data are
shown below:

250 325 333 368 301 386 295 308 320 315
310 332 270 334 356 315 334 370 274 260
Construct a dot plot, stem-and-leaf plot and a frequency distribution for the
data.
11. An ecologist wishes to investigate the level of mercury pollution in a stream
in the Dullstroom area. He catches 25 trout and measures the concentration
of mercury (measured in parts per million) in each fish:

2.2 1.4 1.7 3.4 2.7 2.6 3.0 3.6 3.5 2.6 1.9 3.0 3.8
2.2 2.9 1.8 3.0 3.4 2.8 3.3 3.1 3.2 2.3 2.4 3.7
Construct a dot plot and a stem-and-leaf plot from the data. (Hint: split the
stems in two.)
12. The following observations represent the lifetimes (hours) of a certain type
of energy-saver lamp. Construct a dot plot and stem-and-leaf plot for the
data.

612 1 016 1 022 1 003 1 201 883 898 1 029 1 088 1 135
623 666 744 983 1 029 1 058 1 085 1 122 970 964
13. The following observations were measurements on coating thickness for a
sample of low-viscosity paint. Construct a dot plot and stem-and-leaf plot
for the data.

0.83 0.88 0.88 1.04 1.09 1.12 1.29


1.31 1.48 1.49 1.59 1.62 1.65 1.71

61

Statistics_Method_BOOK.indb 61 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

14. The following observations are carbon monoxide levels (ppm) in air samples
obtained from a certain region in Gauteng. Construct a stem-and-leaf plot
for the data.

9.3 10.7 8.5 9.6 12.2 16.6 9.2 10.5


7.9 13.2 11.0 8.8 13.7 12.1 9.8
15. The Food and Health Department conducted a study on the calorie content
of different types of beer. The calorie contents (calories per 100 ml) for 25
different brands of beer are listed below:

43 29 29 31 35 39 42 41 31 27 31 33 32
34 28 22 30 40 33 19 30 32 31 28 23
Construct a stem-and-leaf plot and comment on the calorie content of beer.

Summarise the data in the following data sets (questions 16–


25) using a frequency distribution and portray the data using
frequency distribution graphs.
16. The treatment times (in minutes) for a sample of 50 patients at a health
clinic are as follows:

10 12 22 21 20 24 20 35 31 24
24 45 12 26 17 29 19 7 27 29
17 18 16 13 2 16 12 15 22 11
11 15 41 29 16 21 24 14 24 16
8 33 18 21 12 13 15 21 10 33

The health clinic advertises that 90% of all its patients have a treatment
time of 40 minutes or less. Does the sample data support this claim? (Hint:
Use the cumulative relative ogive to answer the question.)
17. The amounts of protein (g) for a variety of fast-food sandwiches are reported
here:

29 33 23 25 27 40 30 15 35 35 20 18 26
38 27 27 43 57 44 19 35 22 26 22 14 42
35 12 24 24 20 26 12 21 29 34 23 31 15

18. The following are the numbers of kilometres (in thousands) driven during
the year by 110 food inspectors:

62

Statistics_Method_BOOK.indb 62 2014/12/18 3:01 PM


  Summarising data using tables and graphs

40 29 35 33 88 24 38 28 20 21
43 31 18 67 29 76 26 30 23 18
49 44 97 40 48 15 37 43 36 22
55 54 41 34 35 24 38 47 66 34
65 60 32 56 68 38 42 62 55 42
73 31 31 30 36 61 45 52 50 90
30 50 75 20 34 71 51 48 45 84
36 27 52 39 44 51 11 35 41 73
32 65 40 32 81 42 42 53 45 61
10 41 46 84 28 39 47 63 50 52
26 93 36 38 44 58 52 41 55 48

19. A study on the effects of television on the behaviour in adolescents uses,


as part of the study, the number of hours per day that the television set is
turned on in a household. The following results were obtained in a sample
of 30 households:

6.9 4.6 4.3 5.0 6.0 5.3 4.6 3.9 6.0 3.9
6.3 4.2 6.0 5.6 4.2 4.6 6.0 4.3 3.6 6.0
6.0 5.8 3.9 5.7 6.0 3.9 3.7 3.9 3.7 3.9

20. The following data represent tons of maize harvested each year for 40 years
from Section 20 on an agricultural experiment farm in the Delmas area:

2.71 2.82 1.35 2.20 1.47 2.39 0.59 0.46 1.31 2.50
1.80 0.89 1.64 1.62 1.39 2.19 1.18 1.26 2.04 2.33
1.32 2.60 2.07 0.94 1.42 1.19 2.34 0.77 0.89 1.44
1.62 2.15 0.95 2.02 1.67 1.99 1.48 0.70 0.98 2.00

21. Many people consider the number of calories in an ice-cream bar more
important than cost. To investigate the calorie content, a sample of 26 bars
gave the following results:

342 310 131 294 209 319 111 353 201 295 182 233 323
234 197 377 439 151 286 147 377 190 182 151 260 301
22. The time, in minutes, for a sample of 70 workers waiting at various points
in the production line were as follows:

63

Statistics_Method_BOOK.indb 63 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

1 3 7 23 1 2 5 1 0 6 2 14 5 3 5 6
5 0 1 2 4 5 18 0 1 3 3 1 11 21 1 3
0 6 7 1 19 3 5 1 17 3 5 16 10 2 5 6
1 3 8 5 4 14 15 12 0 2 4 4 2 5 9 6
11 15 13 17 2 20

23. From a sample of 36 full-time students, the following information was


obtained on the time, in hours, each one spent studying last week:

22 11 33 10 28 7 12 25 32
22 46 21 10 18 17 29 14 2
37 35 3 5 18 4 29 21 20
44 23 31 31 24 13 23 10 36

24. Each of the following figures represents the weight of a package passing
through a sorting office. Construct a frequency distribution with cumulative
and relative columns:

7.9 7.8 5.0 8.6 8.1 7.9 8.2 8.1 7.3 8.0 8.2 4.9 8.0 7.5 7.4
8.0 8.0 7.7 7.8 7.5 7.8 5.3 7.9 6.8 7.5 6.9 5.2 8.5 7.9 7.5
5.2 8.2 4.9 8.7 7.7 7.8 6.0 8.1 8.5 8.0 6.1 7.8 8.1 7.6 7.8
7.9 7.9 5.3 7.9 8.1 7.6 7.9 8.3 7.4 8.4 7.6 8.0 8.0 8.2 8.2
6.9 8.1 5.7 7.9 7.7 7.9 6.8 7.8 7.7 7.5 8.1 8.1 8.0 5.1 5.7
6.0 8.0 5.6 8.2 7.6 7.9 6.2 5.4 5.9 7.8 8.7 6.6 8.1 7.7 6.1
7.8 7.4 8.1 7.3 7.1

25. The following data were recorded on a study of flexural strength of high-
performance concrete obtained by using certain binders and super-
plasticisers:

9.0 7.9 6.3 7.0 6.5 6.8 7.6 7.0 6.8


8.1 6.3 7.3 7.2 5.9 8.2 8.7 7.8 9.7
7.4 7.7 9.7 7.8 7.7 11.6 11.3 11.8 10.7

64

Statistics_Method_BOOK.indb 64 2014/12/18 3:01 PM


UNIT Summarising data

4 using numerical
descriptors
In this unit we look at numerical measures that can be used to describe the
characteristics of data collected in its raw form (ungrouped data) as well as for
data summarised into frequency distributions (grouped data).

After completion of this unit you will be able to:


• compute the mean, median and mode for both grouped and ungrouped data
• describe the characteristics of the mean, median and mode
• compute the range, mean average deviation and standard deviation
• compute and interpret the coefficient of variation and the coefficient of skewness
• locate and plot the mean, median and mode for symmetrical and skewed
distributions
• compute measures of relative standing.

Numbers used to describe data sets are called descriptive measures. Statistics
are summary measures used to describe a sample and populations are described
by parameters. For the purpose of this text, samples’ statistics are calculated
and used in later units to estimate the population parameters. Data has three
major characteristics: location, dispersion and shape.

Describe data using measures of:

Location Dispersion Shape

Central tendency • Range • Pearson’s 2nd coefficient of


• Arithmetic mean • Standard deviation skewness
• Median • Variance • Pearson’s 1st coefficient of
• Mode skewness
Relative standing • Coefficient of variation • Box-and-whisker plot
• Quartile • Interfractile ranges
• Percentiles • Quartile deviation

Statistics_Method_BOOK.indb 65 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

4.1  Measures of central tendency


Measures of central tendency numerically describe the average or typical value
of a data set. This is a single value that represents the whole data set. There are
many averages, each having its own characteristics. For the same set of data all
the averages might have different values. Three commonly used averages are:
• the arithmetic mean
• the median
• the mode.

4.1.1 Arithmetic mean


This is the most commonly used measure of central tendency and is often referred
to as the average or the mean. The sample statistic, the mean, is represented by
__
the symbol x ​ ​ (x-bar), and the population parameter, the mean, is represented by
the Greek letter m (‘mu’).
The mean can be seen as the centre of gravity. It is the middle of the actual
numerical values of all the observations, not necessarily in the middle of the
number of observations.
The arithmetic mean is the sum of all the values in a data set, divided by the number
of values in the data set.

Ungrouped data
Ungrouped (or raw) data will usually be presented as a list of numbers in any
order or quantity.
__ Sx
​  n   ​
​  5 
x​
Where:
__
​  5 arithmetic mean
x​
x 5 each observation value
n 5 number of observations

Steps
1. Add the values of the individual observations (Sx).
2. Count the number of observations (n).
__
3. Substitute the totals into the formula for x​
​  .
4. Divide the sum of the values by the number of observations (n).
5. Interpret your answer.

66

Statistics_Method_BOOK.indb 66 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Example 4.1

Calculate the arithmetic mean for the number of cars entering a parking area
during a sample of 10 10-minute intervals.

10 22 31 9 24 27 29 9 23 12

1. Add the numbers:


Sx 5 10 1 22 1 31 19 1 24 1 27 1 29 1 9 1 23 1 12 5 196
2. Count the numbers: n 5 10
3. Use the formula to calculate the mean:
__ Sx 196
​ ​  n   ​5 
x​ 5  ​  10  ​ 5 19.6 ≈ 20 cars

We can conclude that on average 20 cars enter the parking area during a
10-minute interval.

Activity 4.1

A city planner working on bikeways needs information about local bicycle


commuters. One of the questions in her questionnaire asks how many minutes
it takes the rider to pedal from home to his or her destination. A sample of 12
local bicycle commuters yielded the following times:

22 29 27 30 12 22 31 15 26 16 48 23

Determine the mean travelling time.

Grouped data

We cannot calculate exact values of the mean without raw data. If the
source of data is from a grouped frequency distribution, the mean can be
approximated using the technique in this section.
You have to assume that each observation of a class falls on the midpoint
(x) of that class. That means that the observations in a particular interval all
take the same value.

67

Statistics_Method_BOOK.indb 67 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

__ Sxf
​  n   ​
​  5 
x​

Where:
x 5 class midpoint of each class
f 5 frequency of each class
n 5 number of observations in the sample (Sf)

Steps
1. Determine the class midpoint (x) of each class.
2. Multiply the midpoint (x) by the frequency (f) to obtain (xf) of each class.
Write the products in a column with the heading xf.
3. Sum the xf column to obtain Sxf.
4. Sum the frequency column to obtain n: n 5 Sf
5. Substitute the column totals into the formula.
__
6. Calculate the mean (​x​ ) for grouped data.
7. Interpret your answer.

Example 4.2

The following frequency table shows the time (in minutes) taken to travel to
work for a sample of 25 people from Gauteng. Calculate the mean time to travel
to work.

Class boundaries f x xf
15.5 2 , 21.5 2 18.5 37
21.5 2 , 27.5 6 24.5 147
27.5 2 , 33.5 8 30.5 244
33.5 2 , 39.5 4 36.5 146
39.5 2 , 45.5 4 42.5 170
45.5 2 , 51.5 1 48.5 48.5
Total 25 — 792.5

68

Statistics_Method_BOOK.indb 68 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Steps
1. Calculate the midpoints (x) column by adding the lower boundary to the
upper boundary of each class and dividing the sum by 2.
2. Multiply the midpoint by the frequency of each class to obtain the xf column.
3. Sum the xf column.
4. Sum the f column to obtain n.
__
5. Substitute the Sxf and n into the formula and calculate the x​​  .
__ Sxf 792.5
x​ ​  n   ​5 
​  5  ​  25   
​ 5 31.7 minutes

The approximate mean time to travel to work for the sample of people from
Gauteng is 31.7 minutes.

Activity 4.2

Calculate the mean number of hours of personal computer usage per week for a
sample of 16 people and interpret your answer.

Hours spent f x xf

1.95 2 , 3.95 2
3.95 2 , 5.95 5
5.95 2 , 7.95 5
7.95 2 , 9.95 3
9.95 2 , 11.95 1
Total 16

Characteristics of the arithmetic mean


1. It is the arithmetic average of all the values in the data set.
2. Every numerical data set has only one mean.
3. It is reliable because it reflects all the values in the data set.
4. It is sensitive to every value in the data set and can be greatly affected by the
presence of even a single extreme value (or outlier).

Note: An outlier is an unusually large or small observation in comparison with


the rest of the values in the data set. If the outlier is a high value, it will cause the
mean to increase, while a small value outlier will cause the mean to decrease.

69

Statistics_Method_BOOK.indb 69 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

5. It is useful for further inferential statistical procedures.


6. It can be calculated using a pocket calculator with preprogrammed formulae.

4.1.2 Median
The median is the value that occupies the middle position in a data set when
arranged in numerical order. This means that there are an equal number of data
values in the ordered distribution that are above it and below it.

Ungrouped data

Steps
1. Arrange the data in numerical order.
2. Count the number of observations (n).
3. Determine the position of the median.
n11
Median position 5 ​  ​
2   
4. Read the value of the median from the number list.
• If the number of observations is odd, the median is the value that is
exactly in the middle of the data set.
• If the number of observations is even, the median is the average of the
two middle values in the data set.

Example 4.3

Find the median of each data set.


1. Over a seven-day period, the number of customers (per day) purchasing at
Hides Leather Shop was as follows:
4 80 50 10 60 12 5

• Arrange the data in numerical order:


 4 5 10 12 50 60 80

1 711
​ n 1
• Determine the position of the median:  2   ​ 5 
​  2   
​ 5 value number 4
• Count up to value number 4 on the numerical list: median 5 12
• 50% of the time there were less than 12 customers in the shop and 50%
of the time there were more than 12 customers in the shop.

70

Statistics_Method_BOOK.indb 70 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

2. A city planner working on special tracks for bicycles recorded how many
minutes it takes bicycle commuters to pedal from home to their destination.
A sample of 12 bicycle commuters yields the following times:

22 29 27 30 12 22 31 15 26 16 48 23

Determine the median travelling time.

• Numerical order:
 12 15 16 22 22 23 26 27 29 30 31 48

1
• Position of the median: ​ n 1
2    ​  12 21  1 
​ 5  ​ 5 value number 6.5
• Value number 6.5 falls between 23 and 26.
23 1 26
• Median 5 
​  2    ​ 5 24.5
• 50% of the commuters took less than 24.5 minutes to travel to their
destination and 50% took more than 24.5 minutes to travel to their
destination.

Activity 4.3

1. The following numbers represent the typing speeds in words per minute of
five secretaries.
30 90 45 25 55

Determine the median typing speed.

2. How many calories are in a serving of cheese pizza? A variety of pizzas from
different outlets were sampled and the calories per serving were determined.
The calories were as follows:
332 275 393 347 350 353 357 296 358 322 337 323 333 299

Determine the median calorie content and interpret your answer.

Grouped data
You can calculate an estimated median for a frequency distribution either
graphically or by calculation.

71

Statistics_Method_BOOK.indb 71 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

With grouped data we are unable to determine where the true middle value
n
falls, but we assume that the median value will be value number ​ 
2  ​and that the
frequencies in the median class are evenly spread. Use the following formula to
calculate an estimate for the median.
​(  
​  2n ​ 2 F )​c
Median 5 L 1 
​  f     ​ 
m

Where:
n 5 Sf
L 5 lower boundary of the median class
fm 5 frequency of the median class
c 5 width of interval
F 5 sum of all the frequencies up to but not including the median class.

Steps
​ 2n  ​
1. Determine the location of the median: 
2. Construct the cumulative frequency column (cum , f).
3. Compare the position of the median with the cum , f column to determine
which one of the intervals contains the median. The median class is the
​ 2n  ​for the first
interval where the cumulative frequency is equal to or exceeds 
time.
4. Estimate the value of the median using the formula for grouped data.

Example 4.4

Calculate the median time (in minutes) taken to travel to work for a sample of
25 people in Gauteng.

Class boundaries f cum < f


15.5 2 , 21.5 2 2
21.5 2 , 27.5 6 8
27.5 2 , 33.5 8 16
33.5 2 , 39.5 4 20
39.5 2 , 45.5 4 24
45.5 2 , 51.5 1 25
Total 25

72

Statistics_Method_BOOK.indb 72 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

1. Calculate the cumulative frequency column (cum , f ).


25
2. Determine the position of the median: 
​  2  ​ 5 value number 12.5
3. Compare 12.5 with the values in the cum , f column. Up to the end of the
2nd class we have eight values. Therefore value number 12.5 must fall in
class number 3, which contains values number 9 to 16.
4. The median will fall somewhere between 27.5 and 33.5.
5. Substitute the required values from that class into the formula and calculate
the median.

The median time to travel to work for the 25 people in the sample is:
(​  
​  2n ​ 2 F )​c
Median 5 L 1 
​  f     ​ 
m

( 25
​   )
​  2  ​ 2 8  ​6
5 27.5 1 
​    8    ​

5 30.88

This means that half the sampled people travelled less than 30.88 minutes to
work and the other half more than 30.88 minutes.

Use the ogive to determine the median


You can determine the value of the median graphically by making use of the
ogive.

Steps
1. Draw the cumulative ‘less than’ ogive.
n
2. Find the median position ​ 
2 ​ on the vertical axis.
3. Draw a straight horizontal line up to the ogive. Drop a straight line down to
the x-axis.
4. The corresponding value on the horizontal x-axis is the median value.

73

Statistics_Method_BOOK.indb 73 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 4.5

Use the table from example 4.4 and determine the median value graphically.
25
​ 2n ​ 5 
1. Find the position of the median on the vertical axis:  ​  2   ​ 5 12.5
25

20
Cum f (no. of people)

15
#
10

0
#
15.5 21.5 27.5 33.5 39.5 45.5 51.5
Class boundaries (minutes)

2. Draw a straight horizontal line from the position on the y-axis up to the
ogive. Drop a straight line down to the x-axis.
3. Read the median value from the x-axis: median 5 6 31 minutes.

Activity 4.4

Calculate the median number of hours of personal computer usage per week for
a sample of 16 people by making use of a formula as well as a graph.

Hours f
1.95 2 , 3.95 2
3.95 2 , 5.95 5
5.95 2 , 7.95 5
7.95 2 , 9.95 3
9.95 2 , 11.95 1
Total 16

Characteristics of the median


1. It is the central value: 50% of the measurements lie above it and 50% lie
below it.

74

Statistics_Method_BOOK.indb 74 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

2. There is only one median for a data set.


3. Its computation does not involve every measurement in the data set.
– Outliers do not affect the median as strongly as they do the mean because
the median is dependent on the middle position values of the data set,
which excludes outliers at the beginning or end of the ordered data set.
4. It is a better measure of central tendency to use than the mean when the
data is very skewed.

4.1.3 Mode
The mode of a data set is the value that occurs most frequently. It can be a good
measure to represent a typical value such as the most popular shirt size.

Ungrouped data
Tally the number of observations that occur for each data value. If there is no
value that occurs more often than the others, then there is no mode. (Note: this
is not the same as a mode of 0.) A set of data can have no mode, one mode or
more than one mode (bi-modal or multi-modal).

Example 4.6

1. The commission earnings of five colleagues for the previous month were as
follows:
R5 000 R5 200 R5 200 R5 700 R8 600

The modal commission was R5  200 because more of the colleagues earn
R5 200 than any other income.

2. The lengths of stay (in days) for a sample of nine patients in Ward A are:
17 19 19 4 19 26 4 21 4

The modal lengths of stay are 19 days and four days: more of the patients
stay either four days or 19 days than any other number of days.

3. The hourly income rates of five workers are:


R4 R9 R7 R16 R10

There is no mode: none of the workers earn the same income rate.

75

Statistics_Method_BOOK.indb 75 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 4.5

A telephone company conducted a study on the length of long-distance calls. A


sample of 10 calls gave the following lengths in minutes:

1.4 15.5 2.1 8 15.5 1.4 17.7 7.2 9.1 15.5

Determine the modal length and comment on your answer.

Grouped data
An estimate of the mode can be approximated either graphically or by making
use of a formula. Grouped data does not show a single most frequently
occurring value but assumes that the mode will occur in the interval with the
highest frequency.

Mode 5 L 1 ​ ​ 1
D 1D( 
   ​  
1
​c
D
2
)
Where:
L 5 lower boundary of modal class
c 5 width of interval
D1 5 frequency (f) of modal class minus f of previous class
D2 5 frequency (f) of modal class minus f of following class

Steps
1. Select the class containing the highest frequency as the modal class.
2. Determine the D1 value by subtracting the frequency of the class preceding
the modal class from the frequency of the modal class.
3. Determine D2 by subtracting the frequency of the class following the modal
class from the frequency of the modal class.
4. Use the formula to estimate the modal value.
5. Interpret your answer.

76

Statistics_Method_BOOK.indb 76 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Example 4.7

The modal time (in minutes) taken to travel to work for a sample of 25 people is:

Time (minutes) f
15.5 2 , 21.5 2
21.5 2 , 27.5 6
27.5 2 , 33.5 8
33.5 2 , 39.5 4
39.5 2 , 45.5 4
45.5 2 , 51.5 1
Total 25

1. Choose the class with the highest frequency – that is class number 3.

2. Mode 5 L 1 ​ ​ 1
D 1D( 
   ​  
1
D
) ( 
​ c 5 27.5 1 ​ 
2
2
​  2 1 )
 4 ​  ​ 6 5 29.5 minutes

3. More of the people take 29.5 minutes to travel to work than any other time.

Use the histogram to approximate the mode

Steps
1. Draw the histogram of the frequency distribution.
2. Identify the longest bar on the histogram as the modal bar.
3. Draw a line from the top right corner of the modal bar up to the right corner
of the bar to its immediate left.
4. Draw a second line from the top left corner of the modal bar up to the top left
corner of the bar to its immediate right.
5. Draw a straight line parallel to the y-axis through the intersection point of
the previous two lines down to the x-axis.
6. The value on the x-axis approximates the modal value.

77

Statistics_Method_BOOK.indb 77 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 4.8

This histogram is constructed from the frequency distribution in example 4.7


and is used to determine the modal time travelled to work graphically.
8
8

7
6
6
No. of people (f)

5
4 4
4

3
2
2
1
1

0
15.5 21.5 27.5 33.5 39.5 45.5 51.5

Class boundaries (minutes)

The mode is 6 29 minutes.

Activity 4.6

Calculate the modal number of hours of personal computer usage per week for a
sample of 20 people using a formula and read the modal number of hours from
an appropriate graph. Interpret your result in the context of the data.

Time (hours) f
1.95 2 , 3.95 2
3.95 2 , 5.95 5
5.95 2 , 7.95 9
7.95 2 , 9.95 3
9.95 2 , 11.95 1
Total 20

78

Statistics_Method_BOOK.indb 78 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Characteristics of the mode


1. It is the most frequent value in the data set.
2. It is not based on all the data in the sample.
3. It is not affected by extreme values.
4. The mode can be determined for all levels of measurement, including
quantitative and qualitative (nominal and ordinal) data.
5. Sometimes the mode does not exist or it is possible that there is more than
one mode. For these reasons, many people consider the mode an unreliable
measure of central tendency.

4.1.4 Choose between the mean, median or mode


An average should convey an impression of a distribution in a single value. It is
therefore important to use the right type of average. The different averages have
different uses. The factors that play a role in choosing the right average are the
following:
1. Is the nature of the data numerical or non-numerical?
• The mode, which is the value that occurs most often, is the only measure
of central tendency useful for nominal scale data (qualitative data that
you cannot rank in any way). You can also use the mode for all other
qualitative or quantitative (numerical) data sets.
• If you can rank qualitative data sets (ordinal scale), you can use the
median. The median is also valid for all quantitative data sets.
• The arithmetic mean can be calculated only for quantitative data sets.
2. What does each average tells us?
Depending on the situation and the problem under investigation, one
measure may be superior to another, and in some other cases you can use
all three in conjunction.
• The mode identifies the most common or ‘typical’ value, or the value
that occurs more often than the others do. It may be a good choice if one
value occurs much more often than others do. At the same time, the mode
conveys the least amount of information about the data set as a whole. In
some samples the mode may be in the middle of the distribution, but in
others it may be a value at one end of the distribution. It is also possible
to have more than one mode, which will eliminate the mode as an option.
Outliers do not influence the mode at all and the mode stays at the peak
of the distribution.
• The median indicates the centre of the distribution. The same number
of observations lie above and below the median. Outliers occur at the

79

Statistics_Method_BOOK.indb 79 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

beginning or end of a distribution; this means that it is unlikely that


outliers will affect the median very much.
• The mean is the most frequently used average because it includes all the
values in the data set. This feature makes it the most sensitive to extreme
values.
3. What is the shape of the distribution?
• In a symmetrical distribution, the mean, median and mode will be the
same or very close together. Whichever one you choose will give you the
same answer.
• If there are extreme values present on one side of the data set, the
distribution is skewed. If the mean is very different from the median, the
median will be a better option to use. Skewness will be discussed later in
the unit.

Example 4.9

1. The test marks of five students are as follows:


55 59 66 66 94

The arithmetic mean of the marks is 68. This means that the sum of all the
marks evenly divided by all the learners will give you 68. The median value
is 66, which means that half of the learners scored less than 66 and the
other half scored more than 66. The mode is 66, which means that more
learners obtained 66 than any other mark.
  If the values are arranged in numerical order and you slot the arithmetic
mean value in position, you will see that there are four values smaller than
the mean and only one bigger than the mean. This means that the value
on the right is an outlier which pulled the mean to the right, causing the
distribution to be positively skewed. For this reason the median or the mode
will be a better measure to choose.

2. A student obtained the following 4 marks in a semester test:


88 75 95 100

The arithmetic mean of 89.5 is probably the best average to use since it
takes into account all the test marks of the student and therefore indicates
overall performance.

80

Statistics_Method_BOOK.indb 80 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

3. In calculating average house prices in a particular suburb in Gauteng, you


would probably make use of the median. This is because the relatively few
homes with extremely high or extremely low prices do not affect the median
strongly. The median provides a better indication of the average house price.

Activity 4.7

The National Housing Department conducted a survey to estimate the average


number of livable square metres for low-cost housing. The reported mean was
24.5 m2 and the median, 22.2 m2. Which measure of central tendency is more
appropriate? Explain your answer.

4.2  Measures of dispersion


An average summarises a set of data in just one number. Two sets of data can
have the same mean value and yet be very different if one is more spread out
than the other. To describe this difference quantitatively, we use a measure of
dispersion or variability. This is a descriptive measure that indicates the amount
of spread or variation in a data set. Some commonly used measures of dispersion
are the range, mean absolute deviation, standard deviation and variance.

4.2.1 The range


The range is the difference between the largest and smallest values in a data set.
Although it measures the distance across the entire set of data, its usefulness as
a measure of dispersion is limited. It does not tell us how much the other values
in the data set vary from one another or from the mean. The largest or smallest
value (or both) may also be an outlier, which can cause a distorted picture of the
data.

range 5 largest value 2 smallest value

A midrange can be calculated by dividing the range by 2.

For grouped data the range is the difference between the upper boundary of the
last interval and the lower boundary of the first interval.

81

Statistics_Method_BOOK.indb 81 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 4.10

A bakery regularly orders punnets of blueberries for its famous blueberry


cheesecake. The average weight of the punnets is supposed to be 600 g. The baker
uses one punnet of blueberries in each cake. It is important that the punnets are
of consistent weight so that the cake turns out right. Random samples of punnets
from two suppliers were weighed. The weights in grams of the punnets were:

Supplier 1: 480 600 600 600 760

Supplier 2: 480 540 570 760 760

Calculate the range of punnet weights for each supplier and comment on your
results.

Supplier 1: range 5760 2 480 5 280 g

Supplier 2: range 5 760 2 480 5280 g

The ranges are the same, but it is obvious that the variations within the samples
are different. So the range will not solve the bakery’s problem if they want to
choose a supplier that will provide punnets with consistent weights.

4.2.2 Mean absolute deviation (MAD)


This is a better measure of dispersion than the range because it takes every
observation into account and measures variability around the mean; it measures
how much the data differs from the mean.

Note: The deviation of a value in a data set is the difference between that value
and the mean of the data set.

Some of the values are smaller than the mean, which will result in a negative
deviation, and others are larger than the mean, which will result in a positive
deviation.
To prevent negative deviations from the mean cancelling positive deviations,
the algebraic signs of the deviations are ignored and the absolute differences are
averaged.

82

Statistics_Method_BOOK.indb 82 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Ungrouped data

Steps
__
1. Calculate the arithmetic mean (​x​ ) of the distribution.
2. Determine the difference between each value and without regard to the
__
algebraic sign: |x 2 x​
​  |. The two vertical lines indicate that you are using
the absolute value.
__
3. Add the absolute values of the deviations: S|x 2 x​ ​  |
4. Divide the sum by the number of values (n).
__
​  |
S|x 2 x​
MAD 5 ​ 
n    ​ 

Example 4.11

Calculate the mean absolute deviation for the number of cars entering a parking
area during a sample of 10-minute intervals.
__
x |x 2 x​
​|

10 9.6
22 2.4
31 11.4
9 10.6
24 4.4
27 7.4
29 9.4
9 10.6
23 3.4
12 7.6
x 5 196 76.8

__ 196
1. ​x​ 5 
​  10  ​ 5 19.6
__
2. |x 2 x​ ​  | 5 76.8
3. n 5 10 __
​|
S|x 2 x​   76.8
4. MAD 5  ​  n   ​  10   ​ 5 7.68 cars
​ 5 

5. The typical deviation from the mean is 7.68 cars. The smaller the answer,
the less variation we have in the distribution.

83

Statistics_Method_BOOK.indb 83 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 4.8

In a study of special bicycle tracks the number of minutes it takes a sample of


12 local bicycle commuters to pedal from home to their destination is recorded.

22 29 27 30 12 22 31 15 26 16 48 23

Determine the mean absolute deviation travelling time for the riders and interpret
the result in the context of the data.

MAD for grouped data

Steps
__
1. Calculate the arithmetic mean (​x​ ) of the distribution.
__
2. Determine the deviation of each midpoint (x) from x​ ​   without regard to the
__
algebraic sign: |x 2 x​
​  |
3. Multiply the absolute deviation in each class by the frequency of that class:
__
|x 2 ​x​| f
__
4. Add the absolute values of the deviations: S|x 2 x​ ​  |f
5. Divide the sum by the number of values (n 5 Sf)
__
​  |f
S|x 2 x​
6. Formula: MAD 5 ​ ​ 
  n   

Example 4.12

The following frequency table shows the time (in minutes) taken to travel to work
for a sample of 25 people from Gauteng. Calculate the mean absolute deviation
time to travel to work.
__
Time (minutes) f x |x 2 x​
​  |f
15.5 2 , 21.5 2 18.5 26.40
21.5 2 , 27.5 6 24.5 43.20
27.5 2 , 33.5 8 30.5 9.60
33.5 2 , 39.5 4 36.5 19.20
39.5 2 , 45.5 4 42.5 43.20
45.5 2 , 51.5 1 48.5 16.80
Total 25 158.40

84

Statistics_Method_BOOK.indb 84 2014/12/18 3:01 PM


  Summarising data using numerical descriptors
__ Sxf 792.5
1. ​ ​  n   ​5 
x​ 5  ​  25   
​ 5 31.7 min
__
2. Construct the S|x 2 ​x​ |f column and sum the results.
3. Substitute the above values into the formula to determine the MAD.
__
​|
S|x 2 x​ f 158.4
MAD 5 ​ n    ​  25   
​ 5  ​ 5 6.34 minutes

The average absolute difference between each observation of the time taken to
travel to work and the mean is 6.34 minutes.

Activity 4.9

The following frequency distribution summarises the number of hours of


personal computer usage per week for a sample of 16 people. Calculate the mean
absolute deviation for the sample time.
Time (hours) f
1.95 2 , 3.95 2
3.95 2 , 5.95 5
5.95 2 , 7.95 5
7.95 2 , 9.95 3
9.95 2 , 11.95 1
Total 16

4.2.3 Standard deviation (s)


The standard deviation is the most widely used measure of dispersion and
measures on the average, how far each data value is from the mean. To prevent
negative deviations from the mean cancelling positive deviations, the differences
are squared.
1. It uses all the entries in the data set and is therefore sensitive to outliers.
2. The larger the standard deviation, the larger the variation in the data.
A standard deviation of zero means there is no variation.
3. It is useful for further inferential statistical procedures because most
statistical theories are based on distributions described by their mean and
standard deviations.
4. The measuring unit is expressed in the original units of measurements
(rand, minutes, metres, etc).

85

Statistics_Method_BOOK.indb 85 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

5. It can be calculated using a pocket calculator with preprogrammed formulae.

Ungrouped data

Steps
__
1. Calculate the arithmetic mean (​x​ ).
2. Find the difference between each observation and the mean by subtracting
__
from each data value: (x 2 x​
​  )
__ 2
3. Square each difference: (x 2 x​ ​  )
__ 2
4. Sum the squared differences: S(x 2 x​ ​  )
5. Divide the sum by (n 2 1) to get the average difference.

Note: Division by (n 2 1), known as degrees of freedom, corrects the bias


in estimating the population standard deviation using the sample standard
deviation.

6. The standard deviation is the square root of this total.


__ 2

 S(x 2 x​​  )
s 5 ​ 
​  n 2 1   
​ ​ 

7. A large amount of variability in the sample is indicated by a relatively


large value of the standard deviation, whereas a standard deviation close
to zero indicates a small amount of variability. The standard deviation can
be interpreted as a ‘typical’ deviation from the mean. If two samples are
compared, we can say that the sample with the smaller standard deviation
has less variability than the one with the higher standard deviation.

Example 4.13

Calculate the standard deviation for the number of cars entering a parking area
during a sample of 10-minute intervals.

86

Statistics_Method_BOOK.indb 86 2014/12/18 3:01 PM


  Summarising data using numerical descriptors
__ __
x (x 2 x​
​  ) (x 2 ​x​)  2
10 29.6 92.16
22 2.4 5.76
31 11.4 129.96
9 210.6 112.36
24 4.4 19.36
27 7.4 54.76
29 9.4 88.36
9 210.6 112.36
23 3.4 11.56
12 27.6 57.76
Sx 5 196 0.0 684.4
__ 196
1. ​ ​  10  ​ 5 19.6 cars
x​ 5 
__ __
2. Construct a (x 2 x​ ​  ) column: S(x 2 x​
​  ) 5 0.
__ __ 2
3. Square each deviation in the (x 2 ​x​ ) column to obtain (x 2 x​
​  ) .
__ 2
4. Sum the squared differences column to obtain: S(x 2 x​ ​  ) 5 684.4
5. Divide the sum by (n – 1). There are 10 observations in the sample, therefore:
10 2 1 5 9
6. Substitute the calculated sums into the formula to determine the standard
deviation.
__ 2
 
 S(x 2 x​​  )
7. s 5 ​ 
​  n 2 1   
​ ​ 
684.4
5 ​ 
​  10 2 1  ​ ​ 
5 8.72

8. A typical deviation for the number of cars is 8.72.

Activity 4.10

In a study of special bicycle tracks the number of minutes it takes a sample of 12


bicycle commuters to pedal from home to their destination is recorded.

22 29 27 30 12 22 31 15 26 16 48 23

Determine the standard deviation of the travelling time for the riders and
interpret the result in the context of the data.

Grouped data
To estimate the standard deviation from data grouped into a frequency
distribution, we assume that each class is represented by its midpoint (x).

87

Statistics_Method_BOOK.indb 87 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Steps
1. You need a frequency table with the following columns: classes, frequencies
and midpoints.
__
2. Compute the arithmetic mean (​x​ )
3. Subtract the mean from each class midpoint and square the difference:
__
(x 2 ​x​ )2
__
4. Multiply the squared difference by the frequency within each class: (x 2 ​x​ )2f
5. Sum the result to obtain the total squared deviation from the mean:
__ 2
S(x 2 x​ ​  ) f
6. Calculate the average of this total by dividing by (n 2 1)
7. The standard deviation is the square root of this total.
__ 2

8. Formula:  S(x 2 x​​  ) f
​  n 2 1   
s 5 ​  ​ ​ 

Example 4.14

The following frequency table shows the time (in minutes) taken to travel to
work for a sample of 25 people from Gauteng. Calculate the standard deviation
of the time to travel to work.
__
Class boundaries f x (x 2 ​x​ )2f
15.5 2 , 21.5 2 18.5 348.48
21.5 2 , 27.5 6 24.5 311.04
27.5 2 , 33.5 8 30.5 11.52
33.5 2 , 39.5 4 36.5 92.16
39.5 2 , 45.5 4 42.5 466.56
45.5 2 , 51.5 1 48.5 282.24
Total 25 — 1 512.00
__ Sxf 792.5
1. ​ ​  n   ​5 
x​ 5  ​  25   
​ 5 31.7 minutes (from example 4.12)
2. Subtract 31.7 from each x-value and square the difference.
3. Multiply each squared difference by the frequency of that difference and
__
record the answers in the (x 2 ​x​ )2f column. The first value in this column is
calculated as (18.5 2 31.7)2 3 2.
4. Sum the results.
5. Substitute the totals into the formula to determine the standard deviation.

88

Statistics_Method_BOOK.indb 88 2014/12/18 3:01 PM


  Summarising data using numerical descriptors
__ 2
 
s 5 ​   S(x 2 x​​  ) f
​  n 2 1    5 ​ 
​ ​ 
1 512
​  25 2 1  ​ ​ 
5 7.94 minutes

The typical standard deviation between each observation of travelling time and
the mean is 7.94 minutes.

Activity 4.11

The following frequency distribution summarises the number of hours of


personal computer usage per week for a sample of 16 people. Calculate the
standard deviation for the sample times.

Class intervals f
1.95 2 , 3.95 2
3.95 2 , 5.95 5
5.95 2 , 7.95 5
7.95 2 , 9.95 3
9.95 2 , 11.95 1
Total 16

4.2.4 Variance (s²)


The variance is the standard deviation squared (s²). Although this is a very
popular measure in describing data, the main drawback is that the unit of
measure is also squared. Statistics measured in squared units are problematic to
interpret. If the standard deviation is equal to 5.25 hours, the variance will be
27.56 hours squared.

4.2.5 Coefficient of variation (CV)


The coefficient of variation is a relative measure of dispersion, which is the
ratio, expressed as a percentage, of the standard deviation to the mean. This is
sometimes used as measure of risk.

s
​  __x​
CV 5  ​   ​ 3 100

This is a unit-free number because the standard deviation and mean are
measured using the same units. The higher the result the more variability there
is in a set of data.
89

Statistics_Method_BOOK.indb 89 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

All the measures of dispersion described so far have dealt with a single set of
data. In practice it is often important to compare two or more sets of data with
different means, sample sizes or measurement units.

Example 4.15

A manufacturing company produces a product in two sizes: a 1 000 ml bottle


and a 500 ml bottle. Because of mechanical variability in the filling machinery,
there is a standard deviation of 50 ml and 40 ml respectively. The machine with
the lowest CV will be the more consistent.

50
​  1 000
CV(1 000 ml) 5     ​ 
3 100 5 5%

40
​  500  ​ 3 100 5 8%
CV(500 ml) 5 

For the 1 000 ml bottle, the CV of the filling process is 5% of the filling mean. For
the 500 ml bottle, the CV is 8% of the filling mean.
Although the machine filling the smaller bottle has a lower standard deviation,
the CVs indicate that it is the machine filling the larger bottle which is relatively
more consistent.

Activity 4.12

Two growers of grapefruit have obtained the following statistics regarding the
mass of their current crops:
__
Grower A: x​​  5 300 g with s 5 20 g
__
Grower B: x​
​   5 280 g with s 5 40 g

Which grower’s grapefruit are more uniform in mass?

4.3  Measures of shape


Measures of shape are tools that can be used to describe the shape of a
distribution. Two measures of shape are:
1. skewness, which measures its symmetry or lack of symmetry
2. kurtosis, which measures its peakedness.

90

Statistics_Method_BOOK.indb 90 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

4.3.1 Skewness (SK)


Skewness relates to the symmetry or lack thereof in the shape of the histogram,
polygon, stem-and-leaf or dot plot that you can draw from the data. The shape
influences the locations of the mean, median and mode in the data set, for
example, whether the mean is larger or smaller than the median.
In symmetrical or normal distributions the left half is a mirror image of
the right.
When a symmetrical distribution has a single mode, the mode will be in the
centre of the distribution. Furthermore, the mean and the median will be equal
to the mode. There are no outliers on the one side to pull the mean away from the
bulk of the data. The skewness coefficient will have a zero value.
To portray the shape of a distribution you can make use of the histogram or
a smooth polygon.

mean 5 median 5 mode

A distribution is skewed if the curve appears skewed either to the left or to the
right, meaning that the one tail extends more to one side than the other. The
mode stays at the peak of the distribution because outliers do not influence the
mode at all. The influence of outliers is highest on the arithmetic mean because
the mean is affected by all values in the data set, including the extreme ones, and
tends to be located toward the tail of the skewed distribution. The median, being
dependent on the number of values in the data set rather than on the size of those
values, is less sensitive than the mean, since only the middle measurements are
used for its calculation. It is located somewhere between the mode and the mean.
Positive skewness (or skewed to the right) occurs when the majority of
the data values are concentrated on the left. There are a few data values that
are substantially larger than others and these larger values cause the mean to
increase while having little, if any, effect on the median. The mean will exceed
the median, and both the mean and the median will be greater than the mode.
The tail to the right will be longer than to the left.

91

Statistics_Method_BOOK.indb 91 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

mode , median  mean

Negative skewness occurs when the majority of the data values is concentrated
on the right of the distribution. There are a few data values that are substantially
lower than others and these smaller values cause the mean to decrease while
having little, if any, effect on the median. The mean will be less than the median,
and the mode will exceed both the mean and the median. The tail to the left will
be longer than to the right.

mean  median  mode

It is also possible to create arithmetic measures of skewness to determine if the


data fits into one or other of the patterns. The most popular measure covered in
this text is Pearson’s second coefficient of skewness.

Pearson’s second coefficient of skewness


This coefficient compares the mean and median in context of the magnitude of
the standard deviation.
3(mean – median)
SK 5 ​
     
standard deviation ​ 

Simply by knowing the value of the skewness coefficient we can infer the general
shape of the distribution without resorting to a diagram:
• Skewness is measured on a scale: 23 # SK # 13

92

Statistics_Method_BOOK.indb 92 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

• If SK 5 0, the distribution is normal or symmetrical with


the mean 5 median 5 mode.
• If SK2 . 0 but less than 3, the distribution is positively skewed with
the mean . median . mode.
• If SK2 , 0 but greater than 23, the distribution is negatively skewed with
the mean , median , mode.

Example 4.16

If the time taken to complete a particular complex task resulted in a mean 5


34.34 minutes, a median 5 35.27 minutes and a s 5 6.86, calculate
3(mean 2 median) 3(34.34 2 35.27)
SK 5 
​     
standard deviation  ​  ​   
5 
6.86  ​ 
5 20.41

The distribution is negatively skewed but close to normal.

Activity 4.13

In a sample showing the sodium contents (mg/kg) of chocolate pudding made


__
from instant mix, the mean x​​  5 2 965.2, the median 5 2 946 and the standard
deviation 5 543.52
Calculate the coefficient of skewness for this distribution and interpret your
answer. Illustrate your answer with a rough sketch.

4.3.2 Measures of kurtosis


Kurtosis describes the amount of peakedness of a distribution. The flatter the
curve, the greater the spread of the data. This means that the standard deviation
is larger relative to the mean. Although a formula exists to measure kurtosis, it is
easier to determine the extent of the kurtosis by observing the frequency curve
or polygon.
• A distribution that is high and thin is referred to as leptokurtic.
• A distribution that is flat and spread out is referred to as platykurtic.
• A distribution that is more normal in shape, that is neither very peaked nor
flat, is referred to as mesokurtic.

93

Statistics_Method_BOOK.indb 93 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Leptokurtic

Mesokurtic

Platikurtic

4.4 Interpreting centre and variability


1. Dispersion is the amount of spread or scatter that occurs in a data set. It
can be interpreted as the size of a ‘typical’ deviation from the mean. If the
values in the data set are clustered tightly about their mean, the standard
deviation is small, but if the values are widely dispersed about their mean,
the standard deviation is large.
2. In comparing two data sets with the same unit of measure, the one with the
larger standard deviation has the greater amount of variability and the one
with the smaller standard deviation is more consistent, with less variability
among the numbers in the data set.
3. If you have a single data set, the mean can be combined with the standard
deviation to obtain information about how values in a data set are
distributed along a number line. To do this we describe how far away a
particular observation is from the mean in terms of the standard deviation.
For example, we might say that an observation is two standard deviations
above the mean or one standard deviation below the mean. The number of
standard deviations is known as the z-score or the Standardised value.
__
​ 
x 2 x​
​  s   ​
z 5 

Consider a data set with a mean of 100 and a standard deviation of 15.
• The mean minus one standard deviation 5 100 2 15 5 85. This means that
85 2 100
85 is one standard deviation below the mean. The z-score 5 ​   15    ​ 5 21.

A z-score is negative if the data value is less than the mean.
115 2 100
• The z-score for a value of 115 5 ​  
15   ​ 5 1. This means that 115 is one

standard deviation above the mean. A z-score is positive if a data value is
greater than the mean.

94

Statistics_Method_BOOK.indb 94 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

• All observations that fall between 85 and 115 are within one standard
deviation from the mean.
• Two standard deviations 5 2 3 15 5 30, 100 2 30 5 70 and 100 1 30 5
130. All observations that fall between 70 and 130 are within two standard
deviations from the mean.
• 100 1 3(15) 5 145. Observations above 145 exceed the mean by more than
three standard deviations.
4. The following two rules can be applied, depending on the shape of the
distribution.
• If the distribution is symmetrical, you can make a statement about the
proportion of data values that fall within a specified number of standard
deviations of the mean by making use of the empirical rule.
• A more general interpretation of the proportion of data values that fall
within a specified number of standard deviations of the mean is derived
from Chebysheff ’s theorem, which applies to distributions of all shapes.

Empirical rule:
• Approximately 68% of all observations fall within one standard deviation
from the mean.
• Approximately 95% of all observations fall within two standard deviations
from the mean.
• Approximately 99.7% of all observations fall within three standard deviations
from the mean.

Chebysheff’s theorem
• The proportion of observations in any sample that lies within z standard
1
deviations of the mean must be at least (1 2 ​  z2
 ​)  3 100, where z is any value
greater than 1.
1
• For z 5 2, at least (1 2 ​  22
  ​ ) 3 100 5 75% of all observations will fall within
__
two standard deviations of the mean. That will be the values between ​x​  2 2z
__
and ​x​ 1 2z
• For z 5 3, at least (1 2  ​  312  ​ ) 3 100 5 88.89% of all observations will fall
within three standard deviations of the mean. That will be the values between
__ __
​x​ 2 3z and x​
​  1 3z.
• For z 5 4, at ​  412  ​ ) 3 100 5 93.75% of all observations will fall within
least (1 2 
__
four standard deviations of the mean. That will be the values between x ​ ​ 2 4z
__
and x​
​  1 4z.

95

Statistics_Method_BOOK.indb 95 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 4.17

A psychologist randomly selected 10 TV cartoons and counted the number of


incidents of verbal and physical violence in each. The counts were as follows:

13 26 16 21 15 31 15 30 14 11
__
1. ​x​ 5 19.2 hours
2. s 5 7.33 hours
3. Range: 31 2 11
4. The stem-and-leaf plot shows a positive skewed distribution. Pearson’s
second coefficient of skewness can also be used to determine the shape.

Stem Leaf
1 134556
2 16
3 01

5. This distribution is positively skewed; therefore you can use Chebysheff ’s


theorem to come to the following conclusions:
• At least 75% of the data points (x) will fall within two standard
deviations from the mean: 19.2 1 2(7.33) 5 33.86
19.2 2 2(7.33) 5 4.54
That will be the values between 4.54 and 33.86.
• Within three standard deviations from the mean you will find at least
89% of all the data values falling between 22.79 and 41.19.
• As you can see from the range, all the values in the data set fall within two
standard deviations from the mean.

Example 4.18

The stem-and-leaf plot below displays the IQ scores of a sample of 112 children.
Stem: tens
Leaf: ones

96

Statistics_Method_BOOK.indb 96 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Stem Leaf
6 1
7 25679
8 0000124555668
9 0000112333446666778889
10 0001122222333566677778899999
11 00001122333344444477778899999
12 01111123445669
13 006
14 26
15 2
__
The summary statistics for this distribution are: x​
​  5 104.5 and s 5 16.3

This distribution can be reasonably well described by a symmetrical or normal


shape; therefore the empirical rule can be applied.
• Within one standard deviation of the mean:
[104.5 2 1 (16.3)] and [104.5 1 1(16.3)] 5 88.2 and 120.8
Approximately 68% of the IQ scores are between 88.2 and 120.8.
• Within two standard deviations of the mean:
[104.5 2 2(16.3)] and [104.5 1 2(16.3)] 5 71.9 and 137.1
Approximately 95% of the IQ scores are between 71.9 and 137.1.
• Within three standard deviations of the mean:
[104.5 2 3(16.3)] and [104.5 1 3(16.3)] 5 55.6 and 153.4
Approximately 99.7% of the IQ scores are between 55.6 and 153.4.

Activity 4.14

The following frequency table shows the time (in minutes) taken to travel to work
for a sample of 25 people from Gauteng. Interpret the centre and variability for
the sample.
Class boundaries f
15.5 2 , 21.5 2
21.5 2 , 27.5 6
27.5 2 , 33.5 8
33.5 2 , 39.5 4
39.5 2 , 45.5 4
45.5 2 , 51.5 1
Total 25

97

Statistics_Method_BOOK.indb 97 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Use the histogram or Pearson’s second coefficient of skewness to determine if


the distribution is close to normal or skewed. From previous examples we know
that
__
​ ​ 5 31.7 min
x
s 5 7.94 min
median 5 30.88 min

Activity 4.15

Interpret the centre and variability of the number of cars entering a parking
area during a sample of 10-minute intervals. From previous examples we know
that:
__
​ ​ 5 19.6
x
s 5 8.72
median 5 22.5

10 22 31 9 24 27 29 9 23 12

4.5  Measures of relative standing


These measures are used to determine the position of an observation in a set of
data in relation to the other values in the set. The most familiar measures are
quartiles and percentiles, also known as fractiles. Each fractile has position and
value.
A percentile is a value in a sample below which a certain percentage of the
sample data falls. Percentiles divide the data into 100 equal parts and each
percentile (Pj) is a value such that j% of the observations are smaller. (j can
have a value between 1% and 100%.) You can also say that 100% 2 j% of the
observations are more than this Pj.
Certain percentiles are used frequently. These are the 25th percentile and the
75th percentile, also known as the first and third quartiles. Quartiles divide
ordered data into four equal parts. The first quartile (Q1) is a value such that 25%
of the observations are smaller and the third quartile (Q3) is a value such that
75% of the values are smaller.
The median, which is a measure of central tendency, is also a measure of
relative standing. As you have learned previously, the median divides the data
into two equal parts, the bottom 50% and the top 50%. The median is the middle

98

Statistics_Method_BOOK.indb 98 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

quartile, Q2 or P50. The exact location must therefore be determined before the
value can be calculated.

Ungrouped data

Steps
1. Arrange the numbers in numerical order.
2. Change quartiles to percentiles (Q15 P25 and Q3 5 P75)
3. Determine the position of the percentile or quartile you want to obtain.
(Where in the numerical list will you locate this value?) Use the following
formula to determine the position of the quartile or percentile:
jn
Pj 5 
​  100   ​ 
with j the jth percentile and n the number of observations.
4. If the position results in a fraction, choose the next larger integer. If the
position results in an integer, add 0.5.
5. Read the value from the numerical list.

Example 4.19

A psychologist randomly selected 10 TV cartoon shows and counted the number


of incidents of verbal and physical violence in each. The counts were as follows:

26 13 16 21 15 31 15 30 14 11

Determine Q3, Q1, P80 and P20


1. Numerical order:
11 13 14 15 15 16 21 26 30 31

2. Determine the position of each value and read the value from the array.
jn 25(10)
• Q1 position 5 P25 5 
​  100   ​ 5 
​  100   ​ 52.5 rounded to position 3.
Q1 value 5 14. This means that 25% of the TV cartoons have less than 14
incidents of verbal and physical violence per cartoon and the other 75%
have more than 14 incidents per cartoon.
75(10)
• Q3 position 5 P75 5 
​  100   
​ 5 7.5 rounded to position 8.

99

Statistics_Method_BOOK.indb 99 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills


 Q3 value 5 26. This means that 75% of the cartoons have less than 26
incidents of verbal and physical violence per cartoon and 25% have more
than 26 incidents.
80(10)
• P80 position 5 
​  100   
​ 5 8 rounded to position 8.5.

26 1 30
 P80 value 5 
​  2   
​ 5 28
This means that 80% of the cartoons have less than 28 incidents and
20% have more than 28 incidents of verbal and physical violence.
20(10)
• P20 position 5 
​  100   
​ 5 2 rounded to position 2.5.

13 1 14
 P20 value 5 
​  2   
​ 5 13.5
This means that 20% of the cartoons have less than 13.5 incidents.

Activity 4.16

The following data shows the number of cars entering a parking area during a
sample of 10-minute intervals.

10 22 9 24 27 29 9 23 12 31

Calculate Q1, Q3, P90 and P10. Interpret your answers in context of the data.

Grouped data
Steps
1. Construct a frequency distribution with classes and frequencies.
2. Construct the cumulative ‘less than’ frequency column.
3. Determine the position of the quartile or percentile value.

jn
Position Qj 5 
​  4 ​ 

jn
Position Pj 5 
​  100   ​ 

100

Statistics_Method_BOOK.indb 100 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

4. Compare the position of the fractile with the cumulative frequencies to


determine which one of the intervals contains the fractile.
5. Assume that the frequencies in the fractile interval are evenly spread and
estimate the value of the fractile using the fractile formula for grouped data.

​  
Qj 5 L 1 
( jn )
​  4 ​ 2 F  ​c
​  f     ​ 
Q

​ 
Pj 5 L 1 
(  jn
​  

)
​  100   ​ 2 F  ​c
 ​ 
fP

Where:
L 5 Lower boundary of the quartile or percentile class
n 5 Sf
fQ 5 frequency of the quartile class
fP 5 frequency of the percentile class
c 5 width of the class
F 5 sum of the frequencies up to but not including the chosen class

These values can also be determined graphically by making use of the ogive.

Steps
1. Locate the position of the fractile on the y-axis.
2. Draw a horizontal line from the y-axis to the ogive.
3. Drop a straight line down to the x-axis.
4. Read the estimated value of the fractile from the x-axis.

Example 4.20

The administrator of a hospital conducted a survey of the number of days


patients stayed in hospital following an operation.

101

Statistics_Method_BOOK.indb 101 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Days in hospital f cum < f


0.5 2 , 3.5 32 32
3.5 2 , 6.5 108 140
6.5 2 , 9.5 67 207
9.5 2 , 12.5 28 235
12.5 2 , 15.5 14 249
15.5 2 , 18.5 6 255
Total 255

1(255)
​  1n
1. Position of Q1 5  4  ​ 5 ​  4   
​ 5 63.75

3(255)
​  3n
Position of Q3 5  4  ​ 5 ​  4   
​ 5 191.25

85n 85(255)
​ 100
Position of P85 5    ​  5 
​  100   
​ 5 216.75

2. Compare the positions with the cum , f column:


Q1 will fall in interval 2 because value numbers 33 2 140 are contained in
this class.
Q3 will fall in interval 3 because value numbers 141 2 207 are contained in
this class.
P85 will fall in interval 4 because value numbers 208 2 235 are contained
in this class.
​  
3. Qj 5 L 1 
( jn
​  4 ​ 2 F  ​c
​  f    
)
​ 
Q

​  
​  4   
Q1 5 3.5 1 
​ 
( 1(255)
​ 2 32  ​3
​ 
)
5 4.38
108   

This means that 25% of the patients stay less than 4.38 days in hospital.
The rest, 75% of the patients, stay longer than 4.38 days.
​  
​  4   
4. Q3 5 6.5 1 
​ 
  
( 3(255)
​ 2 140  ​3 )
​5 8.79
  67   

This means that 75% of the patients stay less than 8.79 days in hospital.

102

Statistics_Method_BOOK.indb 102 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

5. Pj 5 L 1 
​   
(  jn
​ ​  
5 10.54
​ 

)
100   ​ 2 F  ​c
f p

( 85(225) )
3
​​  
​  100   
​ 2 207  ​​ ​
P85 5 9.5 1 
​ 
   28  ​ 
5 10.54

This means that 85% of the patients stay less than 10.54 days in the hospital.
The other 15% of the patients stay longer than 10.54 days.

Activity 4.17

The following frequency table shows the time (in minutes) taken to travel to work
for a sample of 25 people from Gauteng. Calculate Q1, Q3, P90 and P10. Interpret
your answers in context of the data.
Class boundaries f
15.5 2 , 21.5 2
21.5 2 , 27.5 6
27.5 2 , 33.5 8
33.5 2 , 39.5 4
39.5 2 , 45.5 4
45.5 2 , 51.5 1
Total 25

4.6 Measuring dispersion using measures of relative


standing
These measures of dispersion are resistant to the effect of extreme numbers.
They are frequently used in skewed distributions to give an estimate of spread.

4.6.1 Interfractile ranges


These ranges are measures of the spread between two fractiles in a distribution.
1. The interquartile range includes approximately the middle 50% of values and is
the difference between the Q3 and Q1 values. This means that the first 25% and
the last 25% of the data are cut off. Large values of this statistic indicate that
the first and third quartiles are far apart, indicating a high level of variability.
First 25% First 50% First 75% 100%

Q1 Q2 Q3
P25 P50 P75

103

Statistics_Method_BOOK.indb 103 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

The interquartile range 5 Q3 2 Q1  the middle two quarters or middle 50% range.

2. A middle range is the middle proportion of the data between two percentiles
where the cut-off portions, at the beginning of the data set and the end of
the data set, are equal.
• Middle 80% range 5 P90 2 P10
[This means that the first and the last 10% of the data are cut off.]
• Middle 40% range 5 P70 2 P30
[This means that the first and the last 30% of the data are cut off.]

Example 4.21

Refer to example 4.19: a psychologist randomly selected 10 TV cartoon shows


and counted the number of incidents of verbal and physical violence in each.
The counts were as follows:

26 13 16 21 15 31 15 30 14 11
Q1 value 5 14
Q3 value 5 26
P80 value 5 28
P20 value 5 13.5

Interquartile range 5 Q3 2 Q1 5 26 2 14 5 12
Middle 60% range 5 P80 2 P20 5 28 2 13.5 5 14.5

Activity 4.18

The following data shows the number of cars entering a parking area during a
sample of 10-minute intervals. Calculate the middle 80% range, middle 70%
range and the middle 60% range.

10 22 9 24 27 29 9 23 12 31

4.6.2 Quartile deviation


This measure is associated with the median and is a better measure of dispersion
than the range, but it ignores 50% of the values: the first 25% and the last 25%.

104

Statistics_Method_BOOK.indb 104 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

Steps
1. Determine the Q3 and Q1 values.
2. Subtract Q3 2 Q1 and divide the difference by 2.

Example 4.22

The administrator of a hospital conducted a survey of the number of days patients


stayed in hospital following an operation. Calculate the quartile deviation.
Days in hospital f cum < f
0.5 2 , 3.5 32 32
3.5 2 , 6.5 108 140
6.5 2 , 9.5 67 207
9.5 2 , 12.5 28 235
12.5 2 , 15.5 14 249
15.5 2 , 18.5 6 255

Q1 5 4.38
Q3 5 8.79 (from example 4.20)
Q3 2 Q1 8.79 2 4.38
Quartile deviation 5 QD 5 ​  ​ 5 
2    ​  2  ​ 
 5 2.2

Activity 4.19

Determine the quartile deviation for the number of cars entering a parking area
during a sample of 10-minute intervals.

10 22 9 24 27 29 9 23 12 31

4.6.3 Five-number summary table


To get a quick summary of both centre and spread, use the five-number summary.
The following descriptive statistical measures are used to summarise the data:
1. The smallest data value (S)
2. The lower quartile (Q1)
3. The median (Q2)
4. The upper quartile (Q3)
5. The largest data value (L).
105

Statistics_Method_BOOK.indb 105 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

The easiest way to develop a five-number summary is to arrange the data in


numerical order and identify the smallest and largest values, the quartiles and
the median.

Note: The five numerical values divide the data set into four subsets, with
approximately 25% of the observations in each quarter.

S Q1 Med Q3 L

Example 4.23

A psychologist randomly selected 10 TV cartoons and counted the number of


incidents of verbal and physical violence in each. The counts were as follows:

11 13 14 15 15 16 21 26 30 31

The five-number summary table:


1. The smallest data S 5 11
2. The lower quartile Q1 5 14
3. The median Q2 5 15.5
4. The upper quartile Q3 5 26
5. The largest data value L 5 31

Activity 4.20

Do a five-number summary table for the number of cars entering a parking area
during a sample of 10-minute intervals.

10 22 9 24 27 29 9 23 12 31

4.6.4 Box-and-whisker plot


The five-number summary table can be displayed graphically using a box-and-
whisker plot, also referred to as a box plot.

106

Statistics_Method_BOOK.indb 106 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

It visually shows the range of the data values by indicating the smallest and
largest values, the first and third quartiles, the median (to show where the data
is centred) and how spread out the data is. The degree of symmetry can be
identified by inspection.
• It is compact, and provides information about centre, spread, symmetry and
the presence of outliers.
• These plots are particularly useful when you want to compare several sets of
related data.

Steps
1. Draw a horizontal x-axis which covers the range of the data values (S and L).
2. Do the five-number summary table.
3. Draw a rectangular box horizontal to the x-axis whose left edge is at the
lower quartile value (Q1) and whose right edge is at the upper quartile value
(Q3). The box width is the interquartile range and shows the spread of the
middle 50% of the data.
4. Draw a vertical line inside the box at the median.
5. Draw whiskers (horizontal lines) from the midpoints from each end of the
box out to the smallest value and to the largest value.
6. Place markers at distances 1.5 times the interquartile range from either end
of the box. These are known as the inner fences.
7. Outliers are data values between the inner fences and the smallest and
largest values.
8. A box plot also shows the symmetry or skewness of a distribution. In a
symmetric distribution the Q1 and Q3 values are equally distant from the
median. If the distribution is skewed to the right, the Q3 value will be farther
away from the median than the Q1. If the distribution is skewed to the left,
Q3 will be closer to the median than Q1.

Example 4.24

A psychologist randomly selected 10 TV cartoon shows and counted the number


of incidents of verbal and physical violence in each. The counts were as follows:

11 13 14 15 15 16 21 26 30 31

107

Statistics_Method_BOOK.indb 107 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

The five-number summary table:


1. The smallest data value S 5 11
2. The lower quartile Q1 5 14
3. The median Med 5 15.5
4. The upper quartile Q3 5 26
5. The largest data value L 5 31
6. Left inner fence: 14 2 (1.5 3 12) 5 24
7. Right inner fence: 26 1 (1.5 3 12) 5 44

25 0 5 10 15 20 25 30 35 40 45
Q1 Q3

Interpretation
The five-number summary values are all indicated on the plot:
• Outliers are any observations larger than 26 1 1.5(12) 5 44 or smaller than
14 2 1.5(12) 5 24. The whisker to the left extends to 11, which is the smallest
value and not an outlier. The whisker to the right extends to 31, which is the
largest value and not an outlier. That means there are no outliers.
• The Q1 is closer to the median than Q3; therefore the distribution is positively
skewed.

Activity 4.21

The number of cars entering a parking area during a sample of 10-minute


intervals is given below. Portray the five-number summary table for the
distribution as a box plot.

10 22 9 24 27 29 9 23 12 31

TEST YOURSELF 4

For the distributions below, calculate and interpret in the context of the data (if
possible):
• arithmetic mean
• median
• mode

108

Statistics_Method_BOOK.indb 108 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

• range
• mean absolute deviation
• standard deviation
• variance
• coefficient of variation
• Pearson’s second coefficient of skewness
• draw graphs to determine median and mode
• interquartile range
• quartile deviation
• box plot.
1. The following data represents the pulse rates (beats per minute) of nine
students enrolled for the statistics course:
76 60 60 81 72 80 80 68 73

2. In a research study concerning the long-term effectiveness of nicotine


patches on participants who had previously smoked 20 cigarettes per day,
a sample of 15 participants reported that they now smoke the following
number of cigarettes per day:
10 10 7 10 10 9 8 10 9 8 6 9 8 10 8

3. The amount of aluminium contamination (in parts per million) in plastic


wraps used to cover food was recorded for a sample of 26 plastic specimens:

172 102 30 182 115 30 60 118 183


63 119 191 70 119 222 79 120 244
87 125 291 96 140 511 101 145

4. During a quality assurance check, the actual coffee content (in grams) of six
jars of instant coffee was recorded as:
82.9 76.9 88.0 82.5 82.4 82.8

5. A sample of apples, guavas and mangos were analysed for the pesticide residues
in the fruit. The amounts, in mg/kg, of a certain pesticide were as follows:
0.2 1.6 4.0 5.4 5.7 11.4 0.2 3.4 2.4 6.6 4.2 2.7

6. During one month, records show the following results for the number of
workers absent per day:

109

Statistics_Method_BOOK.indb 109 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

13 14 9 17 21 10 15 22 19 13 5
22 13 19 23 17 21 10 9 20 18

7. The daily sales of a small business (in R’000) are given below for an eight-
day period:
8.2 11.5 10.1 9.4 15.1 6.1 10.3 12.3

8. The following sample of lifetimes (in hours) of a certain type of battery used
in a remote control is recorded as follows:
5.5 5.1 6.2 6.5 5.8 5.6 5.8 6.0

9. Corrosion of reinforcing steel is a serious problem in concrete structures located in


the coastal areas. Researchers have been investigating the use of reinforced bars
made of composite material. In one study, glass-fibre-reinforced plastic bars were
bonded to concrete. The following data was recorded on measured bond strength:
12.1 9.9 7.8 6.2 6. 6 7.0 5.5 5.1 5.2 4.8 15.2 3.8
5.4 5.2 4.9 10.7 13.1 8.5 3.4 20.6 13.8 12.6 4.1 8.9

10. The following frequency distribution summarises a sample of daily sales in


(R’000):
Sales (R’000) Frequency
96.5 2 , 96.9 1
96.9 2 , 97.3 8
97.3 2 , 97.7 14
97.7 2 , 98.1 22
98.1 2 , 98.5 19
98.5 2 , 98.9 32
98.9 2 , 99.3 6
99.3 2 , 99.7 4

11. The number of minutes after their appointment times each of a random
sample of 64 patients had to wait to be served in a major local health facility
were observed as follows:
Waiting time Number of patients
02,4 10
42,8 17
8 2 , 12 16
12 2 , 16 14
16 2 , 20 7

110

Statistics_Method_BOOK.indb 110 2014/12/18 3:01 PM


  Summarising data using numerical descriptors

12. The amount of caffeine in a sample of 250 ml servings of brewed coffee is


summarised in the table below:
Caffeine (mg) Number of cups
59.5 2 , 81.5 1
81.5 2 , 103.5 12
103.5 2 , 125.5 25
125.5 2 , 147.5 10
147.5 2 , 169.5 2

13. The lecturer in computer science recorded the amount of computer time (in
minutes) needed by each student to complete an assignment:
Time Number of students
0.1 2 , 0.5 3
0.5 2 , 0.9 10
0.9 2 , 1.3 16
1.3 2 , 1.7 9
1.7 2 , 2.1 5

14. A study of the number of trips on a particular day for a sample of 40 taxi
drivers revealed the following data:
Number of trips Frequency
02,5 3
5 2 , 10 6
10 2 , 15 8
15 2 , 20 13
20 2 , 25 7
25 2 , 30 3

15. A factory manager records the yearly sick leave (rounded to the nearest half
day) taken by his employees:

Number of days Number of employees


0 2 , 2.5 13
2.5 2 , 5.0 7
5.0 2 , 7.5 17
7.5 2 , 10.0 10
10.0 2 , 12.5 3
12.5 2 , 15.0 2

111

Statistics_Method_BOOK.indb 111 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

16. The engineering division of Continental Motors planned a campaign to


improve plant safety. In preparation, the following accident data was
compiled for a sample of 50 weeks:

Number of accidents Number of weeks


02,5 6
5 2 , 10 25
10 2 , 15 11
14 2 , 20 7
20 2 , 25 1

17. The Strongbo Rubber Company has two factories. Both factories employ
students during the holiday seasons. In factory A, the students are paid on
average R982 per week with a standard deviation of R158. In factory B,
the students earn on average R1 208 per week with a standard deviation of
R214. Which factory has the greatest relative dispersion?

18. Data has been collected on the life (in hours) of two brands of light bulbs.
Compare the two brands using the coefficient of variation.

Brand A Brand B
Mean 5 5 800 Mean 5 5 770
Standard deviation 5 5 100 Standard deviation 5 560

19. The operations manager of a package delivery service is deciding whether


to purchase a new fleet of trucks. When packages are stored in the trucks
in preparation for delivery, you need to consider two major constraints –
weight and the volume for each item. He samples 200 packages and finds
that the mean weight is 13.0 kg with a standard deviation of 1.7 kg. The
mean volume is 0.8 m³ with a standard deviation of 0.2 m³. Compare the
variation of the weight and of the volume.

112

Statistics_Method_BOOK.indb 112 2014/12/18 3:01 PM


UNIT

5 Index numbers

Index numbers are used in business and economics as indicators to measure how
much an economic variable changes over time or differs between two locations.

After completion of this unit you will be able to:


• calculate simple index numbers
• calculate a composite index number
• know the difference between unweighted and weighted index numbers
• change the base period
• calculate link relatives and percentage point changes
• understand the consumer price index.

An index number measures the change in a variable over time, geographical


area or some other characteristic, relative to the value of the variable during a
preselected base period. The reason we are concerned with past changes is that
we base business forecasts on what has happened in the past.

Uses of index numbers


1. Index numbers are commonly used in business and economics as indicators
of changing business or economic activity, such as price levels, money
markets and the economic cycle of prosperity, recession, depression and
recovery. Some examples of well-known index numbers in South Africa are:
• consumer price index (CPI)
• producer price index (PPI)
• new car sales index
• JSE indexes
• unemployment rate.
2. Many business and economic activities are guided by index numbers. For
example, the CPI plays a major role in decisions on salary increases.
3. Index numbers are used to study trends by measuring changes over

Statistics_Method_BOOK.indb 113 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

a period of time. These trends are used in forecasting future economic


activities.
4. Index numbers are used to measure the purchasing power of money. For
example, nominal wages can be changed to real wages by eliminating the
cost of living. If you divide the nominal wage by the CPI, the result will be
real wages. The real wages will then show if there is an increase, a decrease
or no change.

The types of indexes that dominate business and economic applications are:
1. Price indexes (Ip) which are the most frequently used. They measure the
percentage change between two time periods (or other characteristic)
of a product or group of products. The current unit price of a product is
expressed as a percentage of the unit price in the base period.
2. Quantity or volume indexes (Iq) which measure how much the quantity of a
commodity or group of commodities changes over time.

Choosing a base period


The base period is the time in the past against which all comparisons are made.
In constructing an index number to describe the relative changes in commodity
prices or quantities, we must first choose a base period.
1. This period should be recent enough so that comparisons are not greatly
affected by changing technology and consumer behaviour.
2. This period should be one of relative economic stability without any
abnormal influences.
3. The base year index is always given the value of 100 and each subsequent
year will have a value above or below 100, depending on whether there has
been an increase or decrease in the data compared with the base year. For
example, an index number of 124 will mean an increase of 24% since the
base year and an index number of 93 will mean a decrease of 7% since the
base year.
4. Decide on whether to use a fixed base or a chain base. In the fixed-base
method the time period which the values are compared to remains constant.
In the chain-base method the current price or quantity is linked to the
previous period.

An index number that represents a comparison for a single item is a simple index
number. In contrast, when the index number has been constructed for a group of
items, known as a basket of goods, it is an aggregate or composite index number.

114

Statistics_Method_BOOK.indb 114 2014/12/18 3:01 PM


  Index numbers

Index number structure

Simple index Composite index

Unweighted index Weighted index

• Simple unweighted • Laspeyres index


composite index • Paasches index

5.1 Construction of a simple index number


A simple index number measures the changes (as a percentage) in price or quantity
of a single item over time. It is calculated by dividing the current value (numerator)
by a base value (denominator) and then multiplying the result by 100.

Steps
1. Obtain the prices or quantities for the product over the time period of interest.
2. Select the period to be used as base.
3. Divide the current price (Pi) of the commodity by the base price (Pb).
4. Multiply this ratio by 100.
​  current price
5. Price index (I ) 5  ​ 
   3 100
P base price

P
​  i  ​3 100
IP 5  Pb

Pi represents the current period price and Pb the base period price.
6. The formula for a quantity index can be obtained by interchanging the
values of P and Q in the price index formulae:

Q
​  Q i ​ × 100
IQ 5 
b

Qi represents the current period quantity and Qb the base period quantity.

115

Statistics_Method_BOOK.indb 115 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Example 5.1

If milk cost R3.50 per litre in 2012 and R3.85 in 2013, the simple price index
for 2013 will be

P 3.85
​  i  ​3 100 5 
IP 5  ​  3.50  ​ 3 100 5 110
P b

This means that milk increased in price by 10% during the period concerned.

Activity 5.1

The following table provides the price per 500 g and the quantities purchased of
nuts during the years 2012 and 2013. Use 2012 as base and construct simple
price and quantity indexes per commodity for 2013.
Prices per kg (rands) Quantities purchased (kg)
2012 2013 2012 2013
Peanuts 45 50 600 550
Pecans 55 60 300 350
Cashews 60 70 325 325

5.2 Construction of composite (or aggregate) index


numbers
These indexes are used to measure the relative change for a basket of related
products.
• If each product item in the overall index is of equal importance, the index is said
to be unweighted and only prices or quantities are used in the construction of
the index number. This particular index ignores both consumption patterns
and the units to which prices refer.
• If each product item in the overall index is not of equal importance, the index
is said to be a weighted index. Weights are assigned to the different products
included in the index to reflect the relative importance of each product. The
price of an item is generally weighted by the quantity sold during that period.
Prices and quantities are used in the construction of a weighted index.

116

Statistics_Method_BOOK.indb 116 2014/12/18 3:01 PM


  Index numbers

5.2.1 Unweighted index numbers


Simple unweighted composite index number
All the products in the basket of goods are included in the calculation. This index
considers each product in the basket as equally important.
Divide the total of the current year prices for the various products in the basket
by the total of the base year prices.
SP
​  SP i ​ 3 100
Ip 5 
b

Steps
1. Obtain the prices for the commodity over the time period of interest.
2. Select the base period.
3. Sum the prices of all the items in the current period (Pi).
4. Sum the prices of all the items in the base period (Pb).
5. Divide the numerator by the denominator.
6. Multiply the result by 100.
7. Interpret the answer.
8. If you want to calculate a simple unweighted quantity index, substitute all
the prices with quantities.
SQ i
IQ 5 ​  
SQ
 ​3 100
  
b

Example 5.2

The following table shows the costs of course material and price per unit that a
student needs for a course in statistics:
2012 2013
Pb Pi
Textbook 203 229
Calculator 80 70
Answer manual 10 10
Total 293 309

SP 309
​  SP i ​ 3 100 5 
IP 5  ​  293  ​ 3 100 5 105.46
b

117

Statistics_Method_BOOK.indb 117 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

This means that the prices increased by 5.46% over the period under
consideration.

Activity 5.2

Construct a simple unweighted price and quantity index for 2013.


Prices per kg (R) Quantities purchased (kg)
2012 2013 2012 2013
Peanuts 45 50 600 550
Pecans 55 60 300 350
Cashews 60 70 325 325

5.2.2 Weighted composite index numbers


These methods take into account the different consumption (or quantity) levels
of the products in the basket of goods. An important factor in applying this
method is the quantity used.
Laspeyres and Paasches index numbers are the most popular composite index
numbers.

Laspeyres index
This method uses quantities consumed during the base period as a weighting
factor and assumes that whatever the price changes, the quantities purchased
will remain the same. The changes in the index can therefore be attributed to
price changes. Using this method will be misleading when the buying quantities
change significantly from those in the base period. One solution to the problem
of buying quantities that change relative to those of the base period is to change
the base period regularly, so that the quantities are regularly updated.
Laspeyres index may lead to an overestimation of the inflation rate because
people tend to reduce consumption of items which get more expensive.
The best-known Laspeyres index is the consumer price index (CPI).
SPiQb
Ip(L) 5 
​  SP Q   ​ 3 100
b b

118

Statistics_Method_BOOK.indb 118 2014/12/18 3:01 PM


  Index numbers

Steps
1. Collect price and quantity information for each of the product items to be
used in the composite index.
2. Select the base period.
3. Denote the current prices and quantities as Pi and Qi respectively.
4. Denote the base prices and quantities as Pb and Qb respectively.
5. Multiply the current period price (Pi) by the base period quantity (Qb) for
each product item and sum the resulting values (SPiQb).
6. Multiply the base period price (Pb) by the base period quantity (Qb) for each
product item and sum the resulting values (SPbQb).
7. Divide the first sum by the second sum and multiply the result by 100.
8. Interpret the answer.

Paasches index
This method uses quantities of the products in the basket consumed during the
current period as a weighting factor. It measures the change in total cost of goods
that represent a consumption pattern typical of the current year, and therefore
avoids the problem of changing consumption patterns. This can lead to an
underestimation in the rise of the inflation rate. It is also difficult to make year-
to-year comparisons because of the continuous changing of the base period.
SP Q
​  SP i Qi   
Ip(P) 5  ​ 3 100
b i

Steps
1. Collect price and quantity information for each of the product items to be
used in the composite index.
2. Select the base period.
3. Denote the current prices and quantities as Pi and Qi respectively.
4. Denote the base prices and quantities as Pb and Qb respectively.
5. Multiply the current period price (Pi) by the current period quantity (Qi) for
each product item and sum the resulting values (SPiQi).
6. Multiply the base period price (Pb) by the current period quantity (Qi) for
each item and sum the resulting values (SPbQi).
7. Divide the first sum by the second sum and multiply the result by 100.
8. Interpret the answer.

119

Statistics_Method_BOOK.indb 119 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Example 5.3

The following table shows a farmer’s orchard production. Construct Laspeyres


and Paasches price indexes for the year 2013 with 2012 as the base.
Price per box No of boxes
(’000)
2012 2013 2012 2013
Pb Pi Qb Qi P iQ b P bQ b P bQ i P iQ i
Apples 20 25 2.0 1.5 50.0 40.0 30.0 37.5
Oranges 15 17 1.4 1.6 23.8 21.0 24.0 27.2
Mangos 40 35 5.0 2.0 175.0 200.0 80.0 70.0
Bananas 30 38 7.0 9.0 266.0 210.0 270.0 342.0
Total 514.8 471.0 404.0 476.7

SPiQb 514.8
Ip(L) 5 
​  SP Q   ​ 3 100 5 
​  471.0 
 ​ 
3 100 5 109.3
b b

Price increase of 9.3% since 2012.


SP Q   476.7
​  SP i Qi  ​ 3 100 5 ​  
Ip(P) 5  ​ 
404.0  3 100 5 118
b i

Price increase of 18% since 2012.

Activity 5.3

Construct weighted indexes for 2013.


Prices (R) per 500 g Quantities (500 g)
2012 2013 2012 2013
Peanuts 45 50 600 550
Pecans 55 60 300 350
Cashews 60 70 325 325

5.3  Additional topics on index numbers

5.3.1 The consumer price index (CPI)


This index is a very important economic indicator and is used to determine
inflation rate and cost of living. Its monthly publication by the Department of
Statistics is a matter of great public interest.

120

Statistics_Method_BOOK.indb 120 2014/12/18 3:01 PM


  Index numbers

The formula used in determining the CPI in South Africa is that of Laspeyres.
SP Q
​  SPi Q1 
CPI 5 Ip(P) 5   ​ 
3 100
b b

To determine the base year weight factor, at least 10 000 households were sampled
out of different income groups and metropolitan areas. Following international
practice, the base period used for the CPI must change at least every five years.
A monthly index for each consumer item for each area is determined by making
use of the above formula and then a combined CPI is calculated.

The formula for determining inflation rate is as follows:

CPI current month


Inflation rate 5 ​ 
     
CPI of corresponding   of previous
month ​
  year  

5.3.2 Changing the base year


Changing the base period is necessary for fixed-base indexes under a number of
conditions:
• if the original base period is too long ago
• if the two indexes you want to compare have different base periods
• inclusion of new items in the index or disappearance of old ones
• new techniques
• abnormal influences on the base period.
The base period for index numbers can be shifted by dividing each original index
by the index of the newly designated base year and multiplying the result by
100.
original index number
New index number 5 ​ 
   for new base ​ 
original index  3 100

Example 5.4

Value indexes for a car manufacturer.


2008 2009 2010 2011 2012 2013
2010 5 100 74.2 81.4 100.0 114.0 117.0 118.9
2012 5 100 63.4 69.6 85.5 97.4 100.0 101.6

74.2
2008 5 
​  117.0  ​ 
3 100 5 63.4

121

Statistics_Method_BOOK.indb 121 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

81.4
2009 5 
​  117.0  ​ 
3 100 5 69.6

100.0
2010 5 
​  117.0 
 ​ 
3 100 5 85.5

Activity 5.4

For the following consumer price indexes, move the base to 2011.
2007 2008 2009 2010 2011 2012 2013
100 104.2 109.8 116.3 121.3 120.0 117.4

5.3.3 Link (or chain) indexes


Link or chain relatives are indexes for which the base is always the preceding
period. The index for each period is found by making use of the current period
expressed as a percentage of the immediate preceding period. Therefore,
each index number represents a percentage comparison with the preceding
year. Converting the values to indexes makes it easier to observe year-to-year
comparisons or trends, even for exceptionally large numbers.

Example 5.5

The rand value of sales for the Papillion Café between January and May is as
follows:
Month January February March April May
Sales (R) 14 980 16 433 20 194 23 015 23 621
Link relative 109.7 122.9 114.0 102.6
16 433
• February index 5 
​  14 980  ​ 
3 100 5 109.7

20 194
• March index 5 
​  16 433  ​ 
3 100 5 122.9

Activity 5.5

The following table lists the retail price of milk per litre between June and October
at a local grocery store. Determine the link relatives for milk.

122

Statistics_Method_BOOK.indb 122 2014/12/18 3:01 PM


  Index numbers

June July August September October


1.99 2.50 2.50 2.99 3.40

5.3.4 Percentage points change


Percentage points changes can be calculated from chain indexes or fixed-base
indexes. Changes over time are measured by subtraction and the differences
thereafter are referred to as percentage points. Percentage points can then be
expressed as a percentage change from the preceding index.

Percentage point change 5 current index 2 immediate preceding index


change in index
Percentage change 5 
​  original 
   index  ​ 
3 100

Example 5.6

Year Index Percentage points change Percentage change


2010 100
2011 120 20 20%
2012 160 40 33.33%
2013 188 28 17.5%
Percentage points change 5 120 – 100 5 20
160 2 120 5 40
188 2 160 5 28

20
Percentage change 5 
​  100  ​ 3 100 5 20%

40
​  120  ​ 3 100 5 33.33%
5 

28
​  160  ​ 3 100 5 17.5%
5 

As the index gets larger in a fixed-base index, the same percentage change is
represented by a larger difference. For example, a change from 100 to 120 is the
same as a change from 300 to 360, but the impression can be very different. In
practice the solution will be to change the base year.

123

Statistics_Method_BOOK.indb 123 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

5.3.5 Real value versus nominal value of money


Nominal value of money is the ‘face value’ of money, while the real value
is the purchasing power of money after removing the effect of price changes
caused by the inflation rate, cost of living, etc.

Real value of money is obtained by removing the inflation effect or price


changes from the nominal value. Real value is important in calculating economic
measures because it will highlight the extent to which price increases are caused
by cost of living or by actual growth.

Example 5.7

If your income increased from R70 000 to R80 000 and the current inflation
rate is 6%, the nominal growth rate is 14% and the real growth rate is:

10 000 2 ​  (  6
​  100 )
   ​ 3 10 000  ​ 5 9 400  
9 400
​  70 000  ​ 
3 100 5 13%

TEST YOURSELF 5

Calculate all simple and composite index numbers for questions 1 to 5.


1. A lecturer’s essential commodities for teaching statistics:
Unit prices (R) Quantities
Unit 2010 2011 2010 2011
Chalk box 5.00 5.50 4 5
Red pen per 1 0.72 0.75 3 5
Text book 1 103.00 139.00 1 1
Aspirin bottle 12.00 10.00 3 4
2. The Valdo Art School has compiled the following information showing
prices and quantities of the following supplies for 2012 and 2013:
Prices (R) Quantities
2012 2013 2012 2013
Brushes 4.74 4.92 50 43
Canvas 3.10 3.41 920 907
Oil 6.29 6.83 107 121

124

Statistics_Method_BOOK.indb 124 2014/12/18 3:01 PM


  Index numbers

3. The table below shows the prices and annual consumption of the raw
materials used in Gauteng Breweries in 2011 and 2012:
Prices (R) Unit Quantities
2011 2012 2011 2012
Malt 49 46 10 874 15 116
Hops 512 724 732 696
Sugar 46 51 1 865 2 486
Wheat flour 31 27 873 1 093

4. Tixif Limited sells three types of chain saws. Company records showed the
prices (R’000) and quantities sold as follows:
Price (R’000) Quantities
2012 2013 2012 2013
X 30 40 22 30
Y 50 60 31 40
Z 120 99 8 12
5. Mr Hiram, a pensioner, has kept a record of the costs of certain items
purchased weekly:
Price per unit (R) Quantities Purchased
2012 2013 2012 2013
Coffee 12.00 15.00 30 32
Cookies 5.60 4.99 7 9
Sugar 7.50 8.20 20 24
6. Mr Rolling has been offered a job in Cape Town with a salary of R123 500 a
year. The cost of living index is 132. If he presently earns R100 000 a year
in Johannesburg with a cost of living index of 120, will he be financially
better off in the new job?
7. The CPI values for the first eight months of 2012, with 2010 5 100, were:
233 236 240 243 248 249 252 255
Shift the base of the index to June 2009 and determine the purchasing
power of the rand per month for both index series.
8. The producer price indexes with 2010 5 100, for the previous twelve
months were:
181 201 221 227 234 238 245 249 260 268 290 300

125

Statistics_Method_BOOK.indb 125 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Calculate the percentage point increases and the percentage increases in the
index numbers.
9. The reported new cases of tuberculosis in a busy hospital were as follows:
January February March April May June
239 311 289 264 321 199

Calculate the chain indexes and then the percentage point and percentage
increases.
10. The following figures relate to the library expenditure (R) of a small town.
Also given is the retail price index per year:
Year 2010 2011 2012 2013
Expenditure (R) 4 800 5 230 5 800 6 700
Retail index 103 110 120 128

If the retail index is taken into consideration, was there a real increase in the
library expenditure?

126

Statistics_Method_BOOK.indb 126 2014/12/18 3:01 PM


UNIT Summarising bivariate

6 data: simple regression


and correlation analysis

This unit deals with methods of summarising data consisting of observations on


two variables (simple or bivariate data) presented as ordered pairs. The purpose
is to understand if there is a linear relationship between two variables and what
you can do if a relationship exists.

After completion of this unit you will be able to:


• explain the purpose of regression and correlation analysis
• compute and explain the meaning of the coefficients of correlation and
determination
• compute the regression equation and use this measure to do estimates
• compute correlation by making use of ranking.

Regression and correlation analysis are statistical tools used to study the
relationship between two variables, one of which is dependent and the other
independent. It is used to determine:
• whether there is a relationship between the two variables
To describe the relationship between the two variables we first graphically
represent the data in a scatter diagram. This visual representation can give
an immediate impression of a set of data; it will illustrate whether there is a
relationship and also suggest whether the relationship is linear, non-linear,
positive or negative. The strength of the relationship may be concluded
tentatively. Note that this unit deals only with linear relationships. Linear
means that a straight line can be used to represent the data pairs.
• how good that relationship is
The correlation coefficient measures how good this relationship is by making
use of a single number.
• how the relationship can be used to make predictions
Once the scatter diagram and correlation coefficient indicate that a linear
relationship exists between the two variables we proceed to find a linear equation

Statistics_Method_BOOK.indb 127 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

that describes the relationship between the two variables. This equation can be
used to make predictions within the given range of the data (interpolation).

6.1 Response variable (y) and explanatory variable (x)


Each observation of bivariate data can be thought of as a data pair of the form
(x, y).The x is the explanatory (or independent) variable and y is the response
(or dependent) variable. The response variable y is the variable whose value
depends on, or can be explained by, the value of the explanatory or independent
x variable. When we analyse data on two variables, the first step is to distinguish
between the response variable and the explanatory variable.

Activity 6.1

Identify the independent x and dependent y variables in each of the following:


1. Size of advertisements of companies in the yellow pages and the number of
calls to the business that were generated by the advertisement.
2. Number of hours of training per employee and revenue of the company.
3. Hours of training to use Excel and number of errors made using the
program.
4. Daily temperature and ice-cream sales.
5. Body mass and calorie intake.
6. Circulation of a magazine and advertising charges.
7. The hardness and tensile strength of die-cast aluminium.
8. Wattage of a heater and the effective heating area.
9. Number of cavities and sugar intake of children.
10. Number of police on the streets and number of crimes.

6.2  Scatter diagram


A scatter diagram (or scatter plot) is a graph of data from two quantitative
variables (bivariate data). This graph can identify the type of relationship that
might exist between two variables.

Steps
1. Collect pairs of data (x, y). The data are paired in a way that matches each
value from one data set with a corresponding value from a second data set.

128

Statistics_Method_BOOK.indb 128 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

2. Select which variable is the dependent (y) variable and which is the
independent (x) variable. The label y goes to the variable which we want to
predict. The other variable is then labelled x.
3. Arrange the data in two columns, x and y.
4. Draw a set of axes.
5. The horizontal axis represents the x variable and is scaled so that any x
value can be easily located.
6. The vertical axis represents the y variable and is scaled so that any y value
can be easily located.
7. Each pair of observations (x, y) is plotted as a point. That is where a vertical
line from the value on the x axis meets a horizontal line from the value on
the y axis.
8. The points are not connected.
9. Scatter plots can take on the following patterns:
• The plot can show no relationship, because no pattern can be identified.
• The plot can show a positive relationship because the dots start at the bottom
left and move upwards to the top right. Although the data points do not fall
exactly on a line, they appear to cluster about a line. A positive relationship
means that if the x variable increases, the y variable will also increase.
• The plot can show a negative relationship because the dots start at the
top left and move downwards to the bottom right. A negative relationship
means that if the x variable increases, the y variable will decrease.
• If all the points fall exactly along a straight line, in a negative or positive
direction, we say the relationship is perfect.
• Non-linear relationships are beyond the scope of this unit.
y
y

x x
no correlation negative correlation
»

129

Statistics_Method_BOOK.indb 129 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

y y

x x
positive correlation non-linear correlation

Example 6.1

You timed how long it takes 10 workers to assemble an item. It was possible for
you to match these times with the length of the workers’ experience (in months).
The results obtained are shown below:
Person Experience (months) x Time (min) y
A 2 27
B 5 26
C 3 30
D 8 20
E 5 22
F 9 20
G 12 16
H 16 15
I 1 30
J 6 19
67 225

Scatter plot for the data


The time taken to assemble the item is the y variable because time depends on
the experience (x) of the worker.

130

Statistics_Method_BOOK.indb 130 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis
Time taken to assemble an item
35
30
25

Time (min)
20
15
10
5
0
0 4 8 12 16 20
Experience (months)

The scatter plot shows a negative relationship, which means the more experienced
workers take a shorter time to assemble the product.

Activity 6.2

During the baking of a certain type of bread roll, each bread roll goes through
a series of heat processes. The length of time spent under this heat treatment
is related to the lifespan of the bread rolls. A sample of eight bread rolls that
underwent different baking times was selected and the life span (in hours) of
each was recorded:
Length of time 18 13 18 15 10 12 8 4
Life span 23 20 18 16 14 11 10 7

Draw a scatter diagram and interpret the relationship in the context of the given
data.

6.3 Correlation analysis (r)


A correlation exists between two variables when one of them is related or can be
influenced by the other in some way.
The linear correlation coefficient is a numerical measure to describe the
degree of strength and direction by which one variable is related to another. A
linear relationship means that, when graphed, the points approximate a straight
line pattern.
We use an equation, known as Pearson’s correlation coefficient, to measure
this strength. This correlation coefficient is represented by the small letter r.

131

Statistics_Method_BOOK.indb 131 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Steps
1. List the values of the x and the y variables in two columns.
2. Sum the x values (Sx) and sum the y values (Sy).
3. Square each of the x values and add the column (Sx2).
4. Square each of the y values and add the column (Sy2).
5. Multiply each x with its corresponding y value and add the products (Sxy).
6. Substitute these values into the formula for r and determine the value of r.
n Sxy 2 SxSy
r 5 ​ 
    2  
  ​ 
​[n.Sx 2 (Sx)
2
  ][n.Sy 
2 2
2 (Sy) ] ​ 

6.3.1 Characteristics of the linear correlation coefficient


1. When the slope of the scattered points is negative, the r value is negative and
when it is positive, the r value is positive. Thus, the sign of r indicates the
direction of the relationship between the variables x and y.
2. A correlation coefficient can range in value from 21 to 11. The closer to
21 or 11, the better the relationship. If r is close to 0, there is little or no
linear correlation between the two variables.
3. The strength of the correlation is not dependent on direction: r 5 0.95 and
r 5 20.95 are equal in strength. The absolute value of the coefficient
reflects the strength of the correlation; a correlation of 20.7 is stronger
than a correlation of 10.3.
4. A correlation coefficient is a measure of association without a unit.

6.3.2 Interpreting a correlation coefficient (r)


• If an inverse (negative) relationship exists, that is, if y decreases as x increases,
then r will fall between 0 and 21.
• If there is a direct (positive) relationship, that is, if y increases as x increases,
then r will fall between 0 and 11.
• If there is no relationship between x and y, then r 5 0.
• This measure enables us to make statements such as: the correlation is strong,
weak, etc.
Size of r General interpretation
6(0.9 to 1.0) Very strong relationship
6(0.8 to 0.9) Strong relationship
6(0.6 to 0.8) Moderate relationship
6(0.2 to 0.6) Weak relationship
6(0.0 to 0.2) Very weak or no relationship

132

Statistics_Method_BOOK.indb 132 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

• If the association between two variables is strong, then knowing the one
variable helps a lot in predicting the other. But when there is a weak association,
information about one variable does not help much in estimating the other.

6.3.3 The coefficient of determination (r2)


This is a more conservative measure of the relationship because it enables us to
calculate the proportion of the total variation in the dependent y variable that is
explained by the corresponding variation in the independent x variable.
To determine the value of this measure, simply square the correlation
coefficient and multiply the answer by 100.
This answer will always be positive within the range 0 to 100%. It is not
possible to determine the direction of the relationship between the variables.
coefficient of determination 5 r2 3 100

Example 6.2

You timed how long it takes 10 workers to assemble an item. It was possible for
you to match these times with the length of the workers’ experience. The results
obtained are shown below:
Person Experience (x) Time (min) (y) xy x2 y2
A 2 27 54 4 729
B 5 26 130 25 676
C 3 30 90 9 900
D 8 20 160 64 400
E 5 22 110 25 484
F 9 20 180 81 400
G 12 16 192 144 256
H 16 15 240 256 225
I 1 30 30 1 900
J 6 19 114 36 361
Total 67 225 1 300 645 5 331
1. Correlation coefficient:
n Sxy 2 SxSy
​   
r 5 
 2  
  ​ 
2
​[n.Sx 2 (Sx)
2 2
  ][n.Sy  2 (Sy) ] ​ 
10(1 300) 2 (67)(225)
​  
r 5 
     2    ​ 
5 20.90
​ [10(645) 2
  (67) ][10(5
2
  331)  
2 (225) ] ​ 

This is a good negative correlation.

133

Statistics_Method_BOOK.indb 133 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

2. Coefficient of determination:
r2 3 100 5 (20.90)2 3 100 5 81%
81% of the total variation in the time taken to produce an item can be
explained by the variation in experience. The remaining 19% is explained
by other factors.

Activity 6.3

During the baking of a certain type of bread roll, each bread roll goes through a series
of heat processes. The length of time spent under this heat treatment is related to the
lifespan of the bread rolls. A sample of eight bread rolls that underwent different
baking times was selected and the life span (in hours) of each roll was recorded.
Length of time 18 13 18 15 10 12 8 4
Life span 23 20 18 16 14 11 10 7

Calculate the correlation coefficient and the coefficient of determination.


Interpret your answer in the context of the given data.

6.4 Regression analysis


Once the scatter diagram and correlation coefficient indicate that a linear relation
exists between the two variables, the next step is to determine the equation of
the straight line that best describes the pattern of the relationship between the
two variables. This equation, known as the regression equation, can be used to
predict a dependent variable (y) if the independent variable (x) is known.
‘Best’ refers to how close the predictions of y (y^) are to the actual values of y.
A linear relationship between the two variables means that the equation will
result in a straight line when plotted on a graph.

6.4.1 Formulating the regression equation


Any linear function involving two variables can be expressed in the form:
y^ 5 a 1 bx

Where:
y^ 5 estimated y value for a given x value
a 5 intercept on the y axis
b 5 the slope (the average change in y for each change of 1 unit in x)

134

Statistics_Method_BOOK.indb 134 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

Steps
1. Obtain a random sample of n data pairs (x, y), with x the independent
variable and y the dependent variable.
2. Use the data pairs to calculate n, Sx, Sy, Sxy and Sx2.
3. Calculate a and b values in the equation by making use of the method of least
squares: the least squares principle states that the method used to determine
the regression equation must be such that the sum of the squares of the
differences between each actual y value and the corresponding estimated y
value is a minimum.
4. The b value is the slope of the straight line equation. The slope is the amount
by which y increases or decreases when x increases by 1 unit.

nSxy 2 SxSy
​  nSx2 2 (Sx)2  
b 5  ​

5. The a value is the y intercept, that is where x 5 0 and the straight line crosses
the y axis.
Sy Sx
a 5 ​  n   ​2 b 
​  n   ​
6. Substitute the a and the b values into the regression line equation:
y^ 5 a 1 bx

6.4.2 Using the regression equation to make predictions

• The regression equation y^ 5 a 1 bx allows you to use the independent variable


x to make predictions for the dependent variable y.
• Substitute the independent x value for which you require a prediction into the
regression equation and calculate the estimated y value (y^).

6.4.3 Plot the regression line on the scatter diagram

Steps
1. For predictions using a graph, a straight line is fitted to the data. The
best fitting straight line can be obtained by making use of the equation:
y^ 5 a 1 bx
2. Since we are dealing with a linear function, we need to estimate only two
points; the rest of the y^ values will all fall on that straight line.
»

135

Statistics_Method_BOOK.indb 135 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

3. Choose any two x values from the x scale on the scatter diagram. Substitute
the two chosen values for x into the regression equation and obtain the two
corresponding values for y^.
4. Plot the two coordinate points on the same axis as the scatter diagram.
5. Use a ruler and draw a straight line through the two points.
6. When the estimated y^ is plotted against x and the points are connected, the
result is called the regression line or the line of best fit.
7. To do a prediction for a specific x value, find the required x on the x axis and
draw a vertical line to the regression line. From this point draw a horizontal
line to the y axis and read the estimate from the y axis.

Note: Estimated y values are meaningful only for x values in (or close to) the
range of the given data.

Example 6.3

You timed (in minutes) how long it takes 10 workers to assemble an item. It was
possible for you to match these times with the length of the workers’ experience
(months).The results obtained are shown below. Develop the regression equation
and estimate the assembly time for a worker with four months’ experience and
for a worker with 10 months’ experience.
Person Experience x Time y xy x
2
y
2

A 2 27 54 4 729
B 5 26 130 25 676
C 3 30 90 9 900
D 8 20 160 64 400
E 5 22 110 25 484
F 9 20 180 81 400
G 12 16 192 144 256
H 16 15 240 256 225
I 1 30 30 1 900
J 6 19 114 36 361
67 225 1 300 645 5 331

136

Statistics_Method_BOOK.indb 136 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

1. Regression equation:
nSxy 2 SxSy
​  nSx2 2 (Sx)2 ​ 
b 5   

10(1 300) 2 (67)(225)


​   
b 5      ​ 
10(645) 2 (67)2
5 21.06
Sy Sx
a 5 ​ 
n   ​2 b 
​  n   ​

225 67
10   ​ 2 (21.06) ​  10 ​ 5 29.60
a 5 ​  

y^ 5 a 1 bx

y^ 5 26.60 1 (21.06)x

2. Interpret a and b:
The a value in the equation represents the y intercept, that is the point on
the y  axis where x 5 0. In this equation it means that a worker with no
experience will take 29.6 minutes to assemble the product.
The b value represents the slope, which means that for every additional month
of experience, a worker will take 1.06 minutes less to assemble the product.

3. Estimates:
For a worker with four months’ experience:
y^ 5 29.60 1 (21.06)x
y^(x 5 4) 5 29.60 1 (21.06)(4) 5 25.36 min

For a worker with 10 months’ experience:


y^(x 5 10) 5 29.60 1 (21.06)(10) 5 19 min

4. Place the regression line on the scatter plot.


Plot the coordinates (4; 25.36) and (10; 19) obtained from the estimates on
the scatter diagram and join the two points with a straight line.

137

Statistics_Method_BOOK.indb 137 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills
Time taken to assemble an item
35
30
25
Time (min) 20
15
10
5
0
0 4 8 12 16 20
Experience (months)

Activity 6.4

During the baking of a certain type of bread roll, each bread roll goes through
a series of heat processes. The length of time spent under this heat treatment
is related to the lifespan of the bread rolls. A sample of eight bread rolls that
underwent different baking times was selected and the life span (in hours) of
each roll was recorded.
Length of time 18 13 18 15 10 12 8 4
Life span 23 20 18 16 14 11 10 7

If a bread roll spends 16 hours under heat treatment, how long do you expect
it will remain fresh? Do your estimate using the regression equation and the
regression line. Give the meaning of the slope.

6.5  Spearman rank correlation coefficient (rs)


This coefficient measures the strength of the relationship between two variables
on the basis of their ranks instead of values (ordinal data) and can be used in
problems where one or both variables can be ranked even though they cannot be
measured on a numerical scale. It can be interpreted in a manner similar to the
correlation coefficient r, but because a great deal of the data gets lost, it provides
a less reliable result.

1 2 6Sd2
​  n(n2 2 1) 
rs 5   ​ 

138

Statistics_Method_BOOK.indb 138 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

Steps
1. Rank the x and the y variables: assign numbers from 1 onwards to the data
values, starting with the smallest (or largest) value up to the largest (or smallest)
value. Keep each x together with its y. Remember to use the same type of ranking
for x and y: from high to low or from low to high. If two values are the same,
they are first assigned ranks (say 2 and 3) and then the average of the ranks is
determined (2.5).That average is then assigned to each appropriate value.
2. Calculate the difference between ranks of the two variables (d).
3. Square these differences (d2) and add the column.
4. Substitute the required values into the formula and calculate the rank-order
correlation coefficient.
5. Interpret the coefficient.

Example 6.4

The safety officer of a company wants to know if experience influences the quality
of an employee’s work. She selects 10 employees at random and records their
years of work experience and their quality rating as assessed by their supervisors.

Quality rating: 7 5 excellent and 1 5 poor

Assume that the employees with the most years of experience will be the most highly
rated. To keep the type of ranking the same, the least years of experience will be
ranked lowest. Note that the x ranking (quality of work) was done by the safety officer.
x y Rating d d2
Employee Experience x code y code
1 1 2 1 1 1
2 17 5 8 23 9
3 20 5 9 24 16
4 9 6 4.5 1.5 2.25
5 2 3 2 1 1
6 13 5 7 22 4
7 9 4 4.5 20.5 0.25
8 23 6 10 24 16
9 7 3 3 0 0
10 10 6 6 0 0
Total 49.5

139

Statistics_Method_BOOK.indb 139 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills
6(49.5)
​  10(102 2  
rs 5 1 2  1)
 ​ 
5 0.70

The correlation is moderate and positive. This means that the more
experience the employees have, the better their rating.

Activity 6.5

During the baking of a certain type of bread roll, advertised as having a long
shelf-life, each bread roll goes through a series of heat processes. The length of
time spent under this baking process is related to the shelf-life of the bread rolls.
A sample of 8 bread rolls that underwent different baking times was selected,
and the shelf-life (in hours) of each roll was recorded.
Length of time 18 13 18 15 10 12 8 4
Shelf- life 23 20 18 16 14 11 10 7

Determine Spearman’s rank correlation coefficient and interpret your answer.

TEST YOURSELF 6

For the data sets in questions 1 to 9:


• identify the dependent and independent variables
• determine if there is a relationship between the two variables using a scatter
plot
• measure the strength of the relationship using Pearson’s correlation coefficient
• determine the coefficient of determination and interpret its meaning in the
problem
• assume a linear relationship and do the required estimate using the regression
equation
• plot the line-of-best-fit on the scatter plot
• interpret the meaning of the y intercept and the slope
• calculate and interpret Spearman’s rank correlation coefficient.
1. Potato chip lovers do not like soggy chips, so it is important to find
characteristics of the production process that produce chips with an
appealing texture. The following sample data on frying time (in seconds)
and moisture content (%) was selected:

140

Statistics_Method_BOOK.indb 140 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

Frying time 65 50 35 30 20 15 10 5
Moisture content 1.4 1.9 3.0 3.4 4.2 8.1 9.7 16.3
Predict the moisture content of the chips after 40 seconds’ frying time.
2. From the following data determine the resting pulse rate that you would
expect from someone exercising for a daily average of (a) 45 minutes (b) 15
minutes and (c) 2.5 hours:
Daily exercise (min) 20 30 60 10 100 0 120 160 160 180
Pulse/min. 75 70 70 85 50 90 60 52 48 64
3. The following sample data measures levels of anxiety before a test and test
marks obtained for the test.
Anxiety 23 14 14 0 17 20 20 15 21
Test % 43 59 48 77 50 52 46 51 51
What test marks do you expect if the anxiety level was 12?
4. A chemistry lab testing food has 7 divisions that do different chemical tests
on food products. The number of hours devoted to safety training and the
number of hours lost due to industry-related accidents were recorded for
each division:
Safety training 10 19 30 45 50 65 80
Hours lost due to accidents 80 65 68 55 35 10 12
After 60 hours of training, how many hours do you expect to lose due to
accidents?
5. The Rip-Off Vending Machine Company operates coffee vending machines
in several office buildings. The company wants to study the relationship that
exists between the number of cups of coffee sold per day and the number
of persons working in each office. Data for this study was collected by the
company and is presented below:
Number of cups sold 10 20 30 40 30 20 40 40 50 10 40 20
Number of persons 5 6 14 19 15 11 18 22 26 4 23 10
Predict the number of cups of coffee that you would expect to sell if there
are 45 people working in an office.
6. The following data reflects the family income and food expenditure (in
R’000) of a sample of 10 low-income families:

141

Statistics_Method_BOOK.indb 141 2014/12/18 3:01 PM


Statistical Methods and Calculations Skills

Income 24 15 18 12.6 8 9.5 21 11.4 6.4 13.2


Expenditure 3.6 2.9 2.9 2.6 2 2.4 3 2.5 1.8 2.4

Compute all relevant statistics by making use of the step-by-step procedure


and estimate food expenditure of a family with an income of R20 000.
7. When buying almost any item, it is often advantageous to buy it in as large
a quantity as possible. The unit price is usually less for the larger quantities.
The data shown in the table was obtained to test this theory:
Number of units 1 3 5 10 15
Cost per unit 55 52 48 32 25
Estimate the price if you buy 14 items.
8. During the making of certain electrical components each item goes through
a series of heat processes. The length of time (minutes) spent in this heat
treatment is related to the useful life (hours) of the components. To find the
nature of this relationship a sample of 10 components was selected from the
process and tested to destruction.
Time in 25 27 25 26 31 30 32 29 30 44
process
Length 2 005 2 157 2 347 2 239 2 889 2 942 3 048 3 002 2 943 3 844
of life
Predict the useful life of a component that spends 33 minutes in process.
9. The following data shows the present maintenance costs and age of a sample
of eight similar machines used in a clothing factory:
Age 1 3 4 4 6 7 7 8
Maintenance (R) 200 550 650 800 1 150 1 100 1 300 1 500

Compute all relevant statistics by making use of the step-by-step procedure


and estimate the maintenance cost of a five-year-old machine.

142

Statistics_Method_BOOK.indb 142 2014/12/18 3:01 PM


  Summarising bivariate data: simple regression and correlation analysis

10. Below are the rankings of the top 10 products produced by Peter’s Party
Products for last year and this year:
Product Last year This year
Crackers 1 3
Hats 3 1
Masks 4 2
Balloons 6 10
Whistles 7 9
Streamers 8 7
Flags 9 8
Face paint 2 4
Joke food 5 6
Joke cards 10 5
Compute the Spearman correlation coefficient and interpret your answer.
11. Ten sales agents of a company had the following number of years of service:
Agent A B C D E F G H I J
Years 8 6 4 12 5 3 1 14 9 10

The manager of the company arranged the agents in the following order,
from most excellent (H) to least excellent (F).
H I D J A E C B G F

Determine the Spearman rank correlation coefficient between years of


service and excellence.

143

Statistics_Method_BOOK.indb 143 2014/12/18 3:01 PM


Statistics_Method_BOOK.indb 144 2014/12/18 3:01 PM
UNIT

7 Time series

This unit discusses the general use of forecasting in business and several methods
that are available for making forecasts.

After completion of this unit you will be able to:


• explain the purpose of time series analysis
• explain the components of the multiplicative model
• use linear models to analyse and project the trend of a time series
• measure the seasonal effect in a time series
• use time series in forecasting.

Forecasting is the science of predicting the future. It is used in the decision-


making process to help business people reach conclusions about buying, selling,
producing and many other actions.
Time series analysis is known as the forecasting tool. The objective is to analyse
how observed data changes over time in order to detect patterns that will enable
us to predict future values.
Time series analysis helps us cope with uncertainty about the future.
Time series data is numerical data gathered on a given characteristic over a
period of time at regular intervals.

7.1 Components of a time series


It is generally believed that the factors that have influenced data in the past and
the present will continue to do so in more or less the same way in the future. The
primary goal of time series analysis is to isolate and measure these influencing
factors or components for forecasting purposes.
The observed time series data consists of four separate components – trend,
cyclical, seasonal and irregular.
• Secular trend (T) is the underlying long-term movement (increase or decrease)
over time in the recorded data values and is usually the result of long-term

Statistics_Method_BOOK.indb 145 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

factors such as changes in the population size, demographic characteristics of


the population, technology and consumer preferences.
• Cyclical variations (C) are medium-term changes caused by circumstances
which repeat in cycles and cause upward and downward swings, not of
equal length, throughout the series. In business, cyclical variations are often
correlated with the general business cycle of prosperity, recession, depression
and recovery.
• Random or irregular variations (I) occur over short intervals and are
unpredictable with no pattern to their behaviour. These are disturbances due
to ‘everyday’ unpredictable influences, such as weather conditions, illness,
political unrest, crime, war and transport breakdowns.
• Seasonal variations (S) are short-term fluctuations that tend to repeat
themselves over days, weeks, months or quarters.
Examples of seasonal variation:
–– Sales of ice cream will be higher in summer than in winter.
–– A doctor can expect a substantial increase in the number of flu cases every
winter.
–– Shops might expect higher sales shortly before Christmas.
–– The telephone network may be heavily used at certain times of the day (such
as mid-morning and mid-afternoon) and much less used at other times (such
as in the middle of the night).

The classical model used by economists, also known as the multiplicative


model, provides the clearest explanation of the four components that make up
the time series and their relation to each other. This model assumes that each
data observation (y) is made up of a combination of the four components and is
represented by the formula:

y5T3C3S3I

Note: The trend component (T) is stated in the same units as y, while the
remaining three components are expressed as percent adjustments. A value
above 100 indicates an above average effect for the component and a value
below 100 indicates a below average effect.
For a time series composed only of annual data there is no seasonal component.
In that case the time series model becomes:

y5T3C3I

146

Statistics_Method_BOOK.indb 146 2014/12/18 3:01 PM


  Time series

Periodically reported time data, such as monthly, quarterly, weekly or daily,


includes the influence of all four components of the time series.
The process of division can remove or isolate any component of this model
and is called decomposition.

Note: Analysis of cyclical and irregular influences on data is useful for describing
past variations but, because of their unpredictability, their value in forecasting
is very limited. Instead, a number of business indicators are used to forecast
cyclical turning points. Predicting cyclical and irregular variations requires
techniques beyond the scope of this unit.
If we ignore the C and I components, since by definition they can’t be predicted,
the forecasting model will become:

y^ 5 T 3 S

Activity 7.1

The sales (R’000) for Turtle Toys have been analysed and the values of the four
components have been determined for the preceding four quarters. Find the
missing values in the table below, assuming a multiplicative model.
Sales (y) Trend (T) Seasonal (S) Cyclical (C) Irregular (I)
Winter 1 000 50 107 101
Spring 820 1 100 70 105
Summer 988 105 98
Autumn 2 623 1 300 200 97

7.2 Historigram
The standard graph to portray the behaviour of data over time is a line graph,
known as a historigram.
1. Time is the independent variable (x) and is measured along the horizontal
axis.
2. The variable of interest is the dependent variable (y) and is measured on the
vertical axis.
3. Plot the time series values (on the vertical y  axis) against time (on the
horizontal x  axis) as single points, and join the points by straight-line
segments.

147

Statistics_Method_BOOK.indb 147 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

7.3 Time-series decomposition

7.3.1 Trend analysis (T)


The trend can be thought of as the core component of the time series model
about which the other components, cyclical (C), seasonal (S) and irregular (I)
variations fluctuate.
The objective of isolating trend is to enable the underlying movement of the
data to be highlighted by making use of a straight line passing through the data
with a positive or negative slope. Thus, a business sales trend will normally show
whether sales are moving up or down (or remaining static) in the long term.
This component is found by identifying separate trend (T) values, each
corresponding to a time point. Methods that can be used to extract a trend from
a set of time series values are:
• method of least squares
• method of semi-averages
• method of moving averages.

Method of least squares

Steps
1. The observed time series data is the dependent y variable since this is the
variable that we want to predict.
2. The time variable is the independent x  variable. The time periods are
translated into x values by using a simple coding process: 1 represents the
first time period, 2 the second time period, and so on, until the final time
period. Assume that the x code falls in the middle of the time period it
represents.
3. Having established the values for the x and y variables, the following
formulae can be used to identify the trend line through the data:
y^ 5 a 1 bx
where y^ is the trend value for a given time period.
nSxy 2 SxSy
b 5 
​ 
nSx2 2 (Sx)2
​ 
  
»

148

Statistics_Method_BOOK.indb 148 2014/12/18 3:01 PM


  Time series

where b 5 slope of the trend line.

Sy Sx
a 5 
​  n   ​2 b 
​  n   ​
where a 5 y-intercept.
4. To calculate the trend values (y^) with the trend line equation, substitute the
appropriate x code into the equation and compute the value of y^.
5. Plot the trend values (y^) together with the corresponding time periods on
the time-series graph and draw a line through the points. This is the trend
line. By extending this line, future values can be read from it.
6. To forecast future values using the equation, substitute the x with an
appropriate code for the year of forecast (extrapolation) and calculate y^.

Example 7.1

The following table shows the income (R’000) of Super 10 Taxis by year.
Year x-code Income y xy x² ŷ
2009 1 28 28 1 29.0
2010 2 31 62 4 30.3
2011 3 34 102 9 31.6
2012 4 30 120 16 32.9
2013 5 35 175 25 34.2
Total 15 158 487 55 158

nSxy 2 SxSy 5(487) 2 (15 3 158)


b 5 
​ 
nSx2 2 (Sx)2
​ 
  5 ​ 
   5 1.3
    ​ 
(5 3 55) 2 15 2

The b value is the slope, which means that for every additional year the income
will increase by R1 300.
Sy
a 5 ​ 
Sx
n   ​2 b ​ 
158
(  )
​  5   ​ 2 1.3 ​ 
n   ​5 
15
​  5  ​   ​ 5 27.7

The a value is the y intercept, which means that the trend line will cross the
y axis at point 27.7.

y^ 5 a 1 bx 5 27.7 1 1.3x

To forecast the income for the year 2014 we need a code for 2014. The x-unit
is one year, which means that for every year you move forward you move one in
the x-code. Therefore the code for 2014 will be 6 and the forecast for 2014 will be
149

Statistics_Method_BOOK.indb 149 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

y^(2014) 5 27.7 1 1.3x


5 27.7 1 1.3 (6)
5 35.5, that is R35 500
Income for Super 10 Taxis

34

32

30
Income (R’000)

28

26

24

22

20
2009 2010 2011 2012 2013
Years

To obtain the coordinates to plot the trend line, substitute the x in the trend
formula with the appropriate x code for that year to calculate the trend (y^) values
for each of the given years:
y^(2009) 5 27.7 1 1.3(1)
5 29 (1; 29)
y^(2014) 5 27.7 1 1.3(6)
5 35.5 (6 ; 35.5)

Activity 7.2

The following table shows the number of traffic tickets issued by the Alberton
Traffic Department for the first six months of the year.
Month No of tickets
January 120
February 120
March 100
April 90
May 130
June 150

150

Statistics_Method_BOOK.indb 150 2014/12/18 3:01 PM


  Time series

1. Use the method of least squares to calculate the trend values for the period
under question.
2. Forecast the number of tickets that the Traffic Department can expect to
issue for the next three months.
3. Graph the time series as well as the trend line.

Method of semi-averages
This technique involves the calculation of two averages which, when plotted
on a graph as two separate points and joined up, form a straight line (or trend
line).

Steps
1. Split the data into two equal groups. If there is an odd number of years,
simply omit the middle year.
2. Calculate the arithmetic mean for each group.
3. Plot the two means at the midpoints of the time intervals covered by the
respective groups.
4. Join these two points with a straight line. This is the trend line.

To forecast:
5. Extend the straight line up to the required forecast period and use the graph
by reading the value from the y axis; or
6. Calculate the average increase per year by determining the difference
between the two averages and divide this difference by the number of years
between the two averages. Add this increment an appropriate number of
times to the mean of the latter group.

Note: It is important that the two groups in question have an equal number of
data values.

151

Statistics_Method_BOOK.indb 151 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 7.2

The table below shows the sales for Yolanda’s Coffee Shop.

Year Sales (R)

}
2008 80
2009 95  ​275
 
3
   ​5 92

2010 100
2011 110

}
2012 130
2013 145  ​425
 
3
   ​5 142

2014 150

1. The trend value for 1 July 2009 5 92


(1 July 2009 is the middle of the time period starting from 1 January 2008
up to 31 December 2010.)
2. The trend value for 1 July 2013 5 142
142 2 92
3. Average increase per year 5 
​  4  ​ 5 12.5
  
(4 is the number of years between the middle of 2009 and middle of 2013)
4. To forecast for 2015: the time period from the middle of 2013 up to the
middle of 2015 is two years: 142 1 2(12.5) 5 R167
5. Trend line: Plot 92 together with 2009 and 142 together with 2013. Join
the two points with a straight line.
Sales for Yolanda’s Coffee Shop
180
160
140
120
Sales (R)

100
80
60
40
20
0
2008 2009 2010 2011 2012 2013 2014
Years

152

Statistics_Method_BOOK.indb 152 2014/12/18 3:01 PM


  Time series

The trend line shows an increase over the time period.

Activity 7.3

The following table shows the number of traffic tickets issued by the Alberton
Traffic Department for the first six months of the year. Forecast the number of
traffic tickets for July by making use of the method of semi-averages and portray
the trend line on a graph.
Month No of tickets
January 120
February 120
March 100
April 90
May 130
June 150

Method of moving averages


This method shows the trend in a time series by eliminating any obscuring
variations caused by seasonal, cyclical or irregular fluctuations by averaging a
fixed number of periods. One of the drawbacks is that values for some periods
are lost at the beginning and end of the series.
A moving average is an artificially constructed time series in which the value
for a given period is replaced by the mean of that value and the values of some
preceding and succeeding time periods.

Steps
1. If you calculate an odd-numbered moving average (i.e.3, 5 or 7), there will
be a middle time point opposite which to record the answers. For example, in
calculating a three-year moving average, you will start by adding the y-values
for the first three years and dividing the answer by three. This answer will
correspond with the middle of the second year. Move down one year and
calculate the average for years two to four. This answer will correspond with
the middle of year three. Complete the process for all the years.

153

Statistics_Method_BOOK.indb 153 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

2. If you calculate an even-numbered moving average (i.e. 2, 4, 6, or 8), the


resulting averages will correspond between two time points. However, a
trend value is required to coincide with a particular point in time; therefore
an extra step is required to centre the averages. This is done by calculating a
moving average of 2 on the first moving averages column.
3. Plot the moving averages on a graph with the original data to show how the
time series is smoothed.
4. The longer the time period covered in computing the average, the smoother
the resulting curve. If the period is too long, a straight line will result and
the general direction of the curve will be lost.
5. The moving average forecast for the next year is the moving average of the
preceding period.

Example 7.3

The quarterly sales of petrol at Jack’s Garage are represented in the table below.

Time Sales 3-quarterly moving 4-quarterly moving average


period average
4-quarterly Centred
average average
I 1 40
2 37 46
49.0
3 61 52 46.0
43.0
4 58 45 45.2
47.5
II 1 16 43 50.2
52.8
2 55 51 49.0
45.2
3 82 55 48.2
51.2
4 28 50 53.1
55.0
III 1 40 46 51.0
47.0
2 70 53 51.8
56.5
3 50 62 58.4
60.2
4 66 57 56.6
53.0
IV   1 55 54 58.0
63.0
2 41 62 62.8
62.5
3 90 65
4 64

154

Statistics_Method_BOOK.indb 154 2014/12/18 3:01 PM


  Time series

1. The first value in the 3-quarterly moving average column is obtained by (40
1 37 1 61) 4 3 5 46. Write the answer in the middle of the second quarter,
which is the middle of the time period used to calculate this average. Move
down one quarter from the top and calculate the second value: (37 1 61 1
58) 4 3 5 52. This answer corresponds with the middle of the third quarter.
2. The 4-quarterly moving average requires two columns. The first moving
average value in the column is: (40 1 37 1 61 1 58) 4 4 5 49. Write this
answer between the second and third quarters, because that is the middle of
this time period used to calculate the average. Move down one quarter from
the top and calculate the second value in the column: (37 1 61 1 58 1 16)
4 4 5 43. This answer corresponds with the position between the third and
the fourth quarters.

These values do not correspond with the middle of a specific time period;
therefore a centred column is required. The first value in this column is obtained
by: (49 1 43) 4 2 5 46. This answer corresponds with the middle of the third
quarter. Move down one quarter and calculate the second value: (43 1 47.5) 4
2 5 45.2. This answer corresponds with the middle of the fourth quarter.
Quarterly petrol sales for Jack’s Garage
100
90
80
70
60
Sales (R’000)

50
40
30
20
10
0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Time period

Activity 7.4

Construct a four-year and a five-year moving average to smooth the following


time series and graph the results.

155

Statistics_Method_BOOK.indb 155 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Year y
2005 14
2006 20
2007 40
2008 30
2009 28
2010 42
2011 51
2012 25
2013 32

7.3.2 Seasonal variation (S)


Seasonal variations occur within a period of one year or less. Therefore, period
data (weekly, monthly, quarterly and daily) is required. Seasonal variation is
generally expressed as an index number and can be identified using the ratio-to-
moving-average method. A requirement for this method is that we have a time
series sufficiently long enough to allow us to observe the variable over several
seasons.

Ratio-to-moving-average method

Steps
1. List data in date order.
2. Determine the time period to be used for the moving average. If the data is
monthly, use a 12-monthly moving average. Quarterly data needs a four-
quarterly moving average and six days a week needs a six-daily moving
average.
3. Calculate the required moving average. Remember, if the moving average
period is an even number, centre the averages by averaging adjacent moving
averages.
4. Express the original time series values as percentages of the corresponding
centred moving averages by dividing the moving average into the original
data and multiplying the result by 100. These are the individual seasonal
percentages.

156

Statistics_Method_BOOK.indb 156 2014/12/18 3:01 PM


  Time series

5. Summarise the seasonal percentages in a new table by grouping together


the seasonal time periods. For example, all the first quarters together, all the
second quarters together, etc. Use the modified mean approach to compute
an unadjusted seasonal index per period.

Note: A modified mean is the arithmetic mean of the values that remain after
elimination of the smallest and largest values in the column.
6. Add the unadjusted seasonal indexes.
7. Determine the factor needed to adjust the index numbers to typical index
numbers.
Typical quarterly index 5 100 3 4
Typical monthly index 5 100 3 12

total typical index


Factor 5 ​ 
     indexes ​ 
total of unadjusted

8. Calculate the adjusted typical seasonal index numbers by multiplying the


unadjusted index numbers by this factor.
9. Seasonal indexes can be included in short-term forecasts.

Seasonalised forecasting for periodically reported data

Steps
1. Determine a linear trend over the given data using: y^ 5 a 1 bx
2. Code the different time units (quarters, months, weeks, etc) within the
forecasting period.
3. Calculate the trend value for each time unit within the forecasting period.
4. Multiply the trend value for each time unit with the seasonal index of that
time unit.

Deseasonalising data
The influence of seasonality can be removed from a time series by dividing
each original value in the series by the appropriate typical seasonal index for
that period and then multiplying the result by 100. The result is known as
deseasonalised data. Deseasonalised data is used if we wish to compare data

157

Statistics_Method_BOOK.indb 157 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

across seasons to determine if an increase or decrease, irrespective of seasonal


trends, has taken place.

Example 7.4

The quarterly income (R’000) of a soft drink company has been recorded for
four years. The time period in date order is shown in column 1 and the actual
sales (y) in column 2.
1. Quarterly data is given; therefore a four-quarterly moving average is
determined and listed in column 3. The first value is: (52 1 67 1 85 1 54)
4 4 5 64.5. This answer corresponds with a position between the second
and third quarters of 2009. The second value in the column is calculated by
moving down one quarter: (67 1 85 1 54 1 57) 4 4 5 65.8. This answer
corresponds with a position between the third and fourth quarters of 2009.
By moving down one quarter at a time, calculate the rest of the moving
averages.

2. The moving average period (four-quarterly) is an even number; therefore


the values in column 3 must be centred. The first centred value is (64.5 1
65.8) 4 2 5 65.2. This answer corresponds with a position in the middle of
the third quarter of 2009. The second value is calculated by moving down
one quarter: (65.8 1 67.8) 4 2 5 66.8. This answer corresponds with a
position in the middle of the fourth quarter of 2009. By moving down one
quarter at a time, calculate the rest of the centred averages and enter them
in column 4.

3. Obtain the percentage in column 5 by dividing the value in column 2


(actual income) by the corresponding value in column 4 (centred averages)
and multiplying the result by 100. The first value is: (85 4 65.2) 3 100 5
130.4. This corresponds with the third quarter of 2009.

158

Statistics_Method_BOOK.indb 158 2014/12/18 3:01 PM


  Time series

1 2 3 4 5
y 4-q m.a Centred %
2009 1 52
2 67
64.5
3 85 65.2 130.4
65.8
4 54 66.8 80.8
67.8
2010 1 57 68.4 83.3
69
2 75 69.9 107.3
70.8
3 90 71.2 126.4
71.5
4 61 71.8 85.0
72
2011 1 60 72.5 82.8
73
2 77 73.2 105.2
73.5
3 94 74.2 126.6
75
4 63 75.9 83.0
76.8
2012 1 66 77.3 85.4
77.8
2 84 78.3 107.3
78.8
3 98
4 67
Total 1 150

4. Construct a summary table with the years in the first column and the
quarters in the top row. Enter the values from the percentage column into
the summary table. The purpose of this table is to group together all first
quarters, second quarters, third quarters and fourth quarters.

5. Calculate the unadjusted index for the first quarter by cancelling the smallest
value (82.8) and the highest value (85.4) in the column and average the
rest of the values (83.3 4 1 5 83.3).This average is known as the modified
mean. Continue to do the same for the other quarters.

6. The factor needed to change the unadjusted index numbers to typical


400
seasonal index numbers is: 
​ 400.2  ​ 
5 0.9995

7. Multiply the unadjusted index number of each quarter by this factor to


obtain the adjusted or typical index numbers.

159

Statistics_Method_BOOK.indb 159 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Summary table:
Year 1 2 3 4
2009 130.4 80.9
2010 83.3 107.3 126.4 85.0
2011 82.8 105.2 126.6 83.0
2012 85.4 107.3
Unadjusted 83.3 107.3 126.6 83.0 400.2
3 0.9995 3 0.9995 3 0.9995 3 0.9995
Typical index 83.3 107.2 126.5 83.0 400.0

8. Interpret the index numbers: the influence of the season caused the sales
during the first quarter to be 16.7% lower than expected. The second quarter
sales are 7.2% higher than expected, the third quarter sales are higher with
26.5% and the fourth quarter sales are lower than expected by 17% due to
the influence of the season.

Forecasting:
1 2 6
Y Seasonal Deseasonalised x-code xy x2
index y
2009 1 52 83.3 62.4 1 52 1
2 67 107.2 62.5 2 134 4
3 85 126.5 67.2 3 255 9
4 54 83.0 65.1 4 216 16
2010 1 57 83.3 68.4 5 285 25
2 75 107.2 70.0 6 450 36
3 90 126.5 71.1 7 630 49
4 61 83.0 73.5 8 488 64
2011 1 60 83.3 72.0 9 540 81
2 77 107.2 71.8 10 770 100
3 94 126.5 74.3 11 1 034 121
4 63 83.0 75.9 12 756 144
2012 1 66 83.3 79.2 13 858 169
2 84 107.2 78.4 14 1 176 196
3 98 126.5 77.5 15 1 470 225
4 67 83.0 80.7 16 1 072 256
Total 1 150 1 150 136 10 186 1 496

160

Statistics_Method_BOOK.indb 160 2014/12/18 3:01 PM


  Time series

1. The deseasonalised data in column 6 is obtained by dividing the original data


in column 2 by the appropriate typical index for that period. The first value
in column 6 is: (52 4 83.3) 3 100 5 62.4. This means that the income
for the first quarter of 2009 would be R62.4 million if there had been no
seasonal variation. The second value is: (67 ÷ 107.2) 3 100 5 62.5.
2. To do a seasonalised forecast per quarter for 2013, code the original time
period and use the method of least squares together with the sales data to
obtain the quarterly trend equation: y^ 5 61.59 1 1.21x
Note: You can also use the deseasonalised data to calculate your trend
values.
3. Code the period you want to forecast and substitute the x in the trend
equation with the appropriate code for that time period to obtain the trend
value. Multiply each trend value with the seasonal index for that period.

First quarter: y^ 5 61.59 1 1.21(17) 5 82.16

83.3
82.16 3 ​ 
100  ​ 5 68.44
Year x code Trend Seasonal Seasonalised
forecast index forecast
2013 1 17 82.16 83.3 68.43
2 18 83.37 107.2 89.37
3 19 84.58 126.5 106.99
4 20 85.79 83.0 71.21

Activity 7.5

The owner of a pizzeria recorded the number of pizzas sold during the past three
weeks in order to determine the influence of the day of the week on the sales. Do
a seasonalised forecast per day for week 5 and graph the original time series, the
deseasonalised time series and the trend line.
Week Monday Tuesday Wednesday Thursday Friday
1 12 18 16 25 31
2 11 17 19 24 27
3 14 16 16 28 25
4 17 21 20 24 32

161

Statistics_Method_BOOK.indb 161 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

TEST YOURSELF 7
1. Complete the following table assuming the classical multiplicative model:
Trend (T) C S Forecast
Winter 130 80 120
Spring 132 90 100
Summer 134 100 70
Autumn 136 110 110
2. Using the classical model, find the missing values:
Trend C S I Sales
Winter 100 100 90 99
Spring 200 80 100 168
Summer 300 110 110
Autumn 400 120 120 604.8
Total
3. The total units of new government housing under construction for the past
six years in the Gauteng province are given below:
Year Total units
2008 1 488
2009 1 014
2010 1 354
2011 1 474
2012 1 617
2013 1 666
Forecast the number of units that will be built during 2014 and 2015.
4. The data given below (in hundreds) was prepared by a marketing research
agency for Radio UJ:
Year Audience size
2005 31
2006 32
2007 33
2008 30
2009 29
2010 30
2011 28
2012 26

162

Statistics_Method_BOOK.indb 162 2014/12/18 3:01 PM


  Time series

a) Estimate the audience for the year 2013 using the method of least
squares and graph the series.
b) Estimate the audience for the year 2013 using the method of semi-
averages.
5. a) Use the data given in the following table to smooth the time series by
making use of a:
two-year moving average
three-year moving average
four-year moving average.
b) Graph the original data and the moving averages.
c) Forecast the value for the year 2014.
2005 2006 2007 2008 2009 2010 2011 2012 2013
642 819 845 755 767 720 749 794 686

6. The numbers of operational nuclear power reactors in the world for the
given years are listed in the following table:
Year 2008 2009 2010 2011 2012 2013
Number of reactors 83 99 108 110 105 104

Forecast how many nuclear plants will be operational during 2014.


7. The expenses (R’000) of Mr Hyde’s Chemistry Research Lab are listed for the
previous seven years:
2008 2009 2010 2011 2012 2013 2014
19 200 13 800 11 270 13 800 9 400 8 900 9 000

Forecast the expenses for 2015.


8. The table data represent the ultraviolet index for Durban during a holiday
season:
1 Dec 2 Dec 3 Dec 4 Dec 5 Dec 6 Dec 7 Dec 8 Dec 9 Dec 10 Dec
9 4 10 10 9 8 8 10 10 9

Forecast the index for the next two days.


9. Metabolic rate is the rate at which the body consumes energy and is
important in studies of mass gain, dieting and exercise. The metabolic rates,
in calories per 24 hours, of a man who took part in a study of dieting, are
recorded for seven weeks:

163

Statistics_Method_BOOK.indb 163 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7

1 867 1 792 1 666 1 614 1 460 1 439 1 362

Forecast the metabolic rate for week 8.


10. The quarterly sales data (R’000) for Ajax washing power are provided
below:
Summer Autumn Winter Spring
2009 10.4 11.8 8.5 7.5
2010 12.2 13.6 9.5 8.8
2011 13.5 13.1 10.4 9.7
2012 11.7 12.9 9.5 8.4
2013 13.7 15.0 10.9 10.1
a) Do a seasonalised quarterly forecast for 2014.
b) Deseasonalise the time series.
c) Graph the time series, trend values and deseasonalised sales.
11. The sale of municipality houses to existing tenants is tabulated below:
No of houses sold
Year Jan–Apr May–Aug Sep–Dec
2010 43 80 60
2011 60 100 70
2012 80 120 90
2013 90 140 100
a) Deseasonalise the time series.
b) Forecast seasonalised house sales per term for 2014.
12. The electricity consumption (kilowatts per hour) for an engineering
workshop is given in the following table:
Jan–Feb Mar–Apr May–Jun Jul–Aug Sep–Oct Nov–Dec
2011 245 220 190 185 200 225
2012 248 215 186 187 201 230
2013 250 225 189 188 198 226

a) Deseasonalise the time series.


b) Forecast seasonalised consumption per period for 2014.

164

Statistics_Method_BOOK.indb 164 2014/12/18 3:01 PM


  Time series

13. The daily income (R’00) of a dry cleaner is tabulated below:


Mon Tue Wed Thu Fri Sat
Week 1 27 23 29 28 37 55
Week 2 30 25 35 33 36 57
Week 3 32 28 34 34 40 54

a) Deseasonalise the time series.


b) Forecast seasonalised income for week 4.
14. The total expenditure (R’000) on part-time teaching by the statistics
department is tabulated below:
2010 2011 2012 2013
Jan–Jun 155 150 140 130
Jul–Aug 100 95 90 80

a) Forecast the seasonalised expenditure per semi-annual for 2014.


b) Deseasonalise the time series.

165

Statistics_Method_BOOK.indb 165 2014/12/18 3:01 PM


Statistics_Method_BOOK.indb 166 2014/12/18 3:01 PM
UNIT
Probability: basic
8 concepts

The theory of probability grew out of the study of various games of chance using
coins, dice, cards, lottery and gambling machines. Since then, probability theory
has been developed to determine uncertainties in our everyday lives as well.

After completion of this unit you will be able to:


• define probability
• describe the classical, empirical and subjective approaches to probability
• understand the meaning of basic terms used in the probability theorem
• apply the properties of probability
• calculate probabilities using the rules of addition and multiplication
• use the counting rules.

There is an implied condition of uncertainty in each of the following questions:


• Is there a link between second-hand smoking and asthma in young children?
• What are my chances of getting that new job?
• What is the estimated influence of the Aids epidemic on population growth?
• What is the chance that it will rain today?
We make decisions in the face of uncertainty. That means facing situations
where it is possible that things could turn out in different ways, but we simply
do not know how probable each event or outcome is. Our need to cope with
this risk, or the ‘chance that it will happen’, leads us to the study and use of
probability theory.
Inferential statistics involves using statistics obtained from a sample to make
estimates and decisions concerning the entire population. We can never be
certain that our decisions are correct, but to assess how good they will be we need
to know how to measure ‘chance’ and ‘probability’. The science of measuring
‘uncertainty’ is called probability.

Statistics_Method_BOOK.indb 167 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Note: Probability describes the relative possibility (chance or likelihood) that an


event will occur.

8.1 Language of probability
• An experiment or investigation is an action that generates the uncertain
outcomes to which we will assign probabilities.
• A particular result of an experiment is an outcome.
• A sample space of a random experiment is a list of all the possible outcomes
of the random experiment.
• The individual outcomes in a sample space are called simple events. An event
is a collection of one or more outcomes of an experiment.

Example 8.1

A family has three children. Their blood type can either be O or A. The sample
space for the blood types of the three children contains eight possible outcomes:
[OOO, AOO, OAO, OOA, AAO, AOA, OAA, AAA]

Each of these outcomes determines a simple event.


1. The event that exactly one of the children is of blood type A 5
[AOO, OAO, OOA]
2. The event that at most one child has blood type A 5 [OOO, AOO, OAO, OOA]
3. The event that all children have the same blood type 5 [AAA, OOO]

Activity 8.1

The type of transmission – automatic (A) or manual (M) – is recorded for each of
the next two cars purchased from a certain dealer.
1. What is the random experiment?
2. What is the sample space?
3. List the outcomes in each of the following events:
a) that at least one car has an automatic transmission
b) that exactly one car has an automatic transmission
c) that neither car has an automatic transmission.

168

Statistics_Method_BOOK.indb 168 2014/12/18 3:01 PM


  Probability: basic concepts

8.2  Approaches to assigning probabilities


Each outcome in a sample space has a probability. Each event has a probability.
The method you use to calculate a probability depends on the approach you use.

8.2.1 Classical approach


Classical probability is used in situations where the outcomes of the experiment
are equally likely. The probability of an outcome or event happening is computed
by dividing the number of favourable outcomes (or successes) by the number of
possible outcomes.

number of successes
P(E) 5 
​    of outcomes ​ 
total number 

P(E) is read as the probability that event E will occur.

Steps
1. Identify the event (or success).
2. Find the number of successes.
3. Find the total number of outcomes in the experiment.
4. Divide the number of successes by the total number of possible outcomes.
5. Interpret the probability.

Example 8.2

In rolling a fair die once, each of the possible outcomes in the sample space (1,
2, 3, 4, 5 or 6), has an equal chance of occurring. In calculating the probability
of the event obtaining an even number in one roll of the die, your number of
successes is (2 or 4 or 6).

3
P(even number) 5 
​  6  ​ 5 0.5

Activity 8.2

In drawing a card from a deck of 52 cards, what is the probability that it will be
an ace?

169

Statistics_Method_BOOK.indb 169 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

8.2.2 Empirical probability


This approach is based on relative frequencies. The probability of an event
happening is determined by observing what proportion of the time similar
events happened in the past or if the experiment is repeated many times under
identical conditions.

number of times
 the event occurred
P(E) 5 ​ 
   
total number of observations
  ​ 

As you increase the number of times an experiment is repeated, the empirical


probability of an event approaches the classical probability of the event.

Note: Chance behaviour is unpredictable over the short run, but has a regular
and predictable pattern in the long run.

Steps
1. Identify the event (or success).
2. Find the frequency of the event, that is the number of times the event
occurred in the experiment or in the past.
3. Find the total number of outcomes in the experiment.
4. Divide the number of times the event occurred by the total number of possible
outcomes.
5. Interpret the probability.

Example 8.3

1. If you toss a coin 10 times and get three heads, you obtain an empirical
3
probability of  ​ 10   ​.  The frequency of the event ‘heads’ is 3 and the total of the
possible frequencies is 10. Because you tossed the coin only a few times, your
empirical probability is not representative of the classical probability, which
1
is ​ 
2  ​. If, however, you toss the coin several thousand times, the probability
will move very close to the classical probability.

2. A restaurant wants to determine the probability that its manager is going to


reject the next delivery of fresh oysters from a supplier. Records show that
the supplier sent the restaurant 90 batches of oysters in the past, and the
manager rejected 10 of them. The frequency of the event ‘reject’ is 10 and

170

Statistics_Method_BOOK.indb 170 2014/12/18 3:01 PM


  Probability: basic concepts

the total of all possible frequencies is 90.


10
P(rejecting next batch) 5  ​  90 ​ 5 0.11
With a probability of 0.11, the event is very unlikely to occur.

80
P(accepting next batch) 5  ​  90 ​ 5 0.89
This probability is very likely to occur.

Activity 8.3

In a survey, a sample of 100 students were asked if they think that cloning of
humans should be allowed. Ninety two said it should not be allowed, five said it
should be allowed and three had no opinion. Calculate the probabilities for each
event in the survey.

Activity 8.4

We all know that fruit is good for us and that we don’t eat enough. In a recent
study among a random sample of 75 teenage boys the following information
was collected:
Fruit servings per day Number of boys % of boys
0 20 27
1 15 20
2 15 20
3 12 16
4 8 11
5 5 6
Total 75 100
1. What is the probability of a teenage boy eating three fruit servings per day?
2. What is the probability of a teenage boy eating no fruit per day?

8.2.3 Subjective probability approach


Subjective probabilities are based on intuition and give a numerical estimate
of the likelihood that a particular outcome will occur using past experience,
judgement, educated guesses or opinion.

171

Statistics_Method_BOOK.indb 171 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 8.4

Given a patient’s health and extent of injuries, a doctor may feel that the patient
has a 90% chance of full recovery.

8.3  Properties of probabilities


1. The probability for any event is between 0 and 1 inclusive. That means a
probability cannot be negative, nor can it exceed 1.
0 # P(E) # 1
• If P(E) 5 0, then the event has no chance of occurring (impossible).
• If P(E) 5 1, then the event is certain to occur.
• The closer the probability is to 1, the better the chance that the event will
happen.
• The closer the probability to 0, the weaker the chance that the event will
happen.
• If P(E) 5 0.5 we say the chance is 50–50 or there is an even chance.
• A P(E) of more than 0.5 and less than 1 is said to be likely to occur.
• A P(E) of less than 0.5 but more than 0 is said to be unlikely to occur.
2. The sum of the probabilities for all the possible outcomes (or events) of an
experiment must be equal to one: SP(E) 5 1

Example 8.5

a) If you flip a coin, the possible outcome is heads or tails.


The event of obtaining heads is P(H) 5  ​  12  ​
The event of obtaining tails is P(T) 5  ​  12  ​
If you add the probabilities of the two possible outcomes, the total 5 1
b) A restaurant wants to determine the probability that its manager is
going to reject the next delivery of fresh oysters from a supplier. Records
show that the supplier sent the restaurant 90 batches of oysters in the
past and the manager rejected 10 of them.
10
P(rejecting next batch) 5  ​  90  ​ 5 0.11

80
P(accepting next batch) 5 ​  90 ​ 5 0.89
If you add the probabilities of the two possible outcomes, the total is 1.

172

Statistics_Method_BOOK.indb 172 2014/12/18 3:01 PM


  Probability: basic concepts

3. The complement rule: The complement of an event E is that event E does


not occur. That means all the outcomes in the sample space that do not
belong to event E. The sum of the probabilities assigned to all the possible
outcomes in a sample space is equal to 1. If the probability of occurrence
__
of
event E is P(E) and the probability
__
that event E will not
__
occur is P(​E​ ), then
P(E) 1 P(​E​ ) 5 1 and P(E) 5 1 2 P(​E​ )

Despite its simplicity, the complement rule can be very useful. The task of finding
the probability that an event of interest will not occur is sometimes easier or less
time-consuming than finding the probability that it will occur.

Example 8.6

If the probability of completing a job is 0.8, the probability of not completing


the job is:
P(not completing the job) 5 (1 2 0.8) 5 0.2

Activity 8.5

The probability that a typist will make at most five mistakes is 0.64. What is the
probability that she will make more than five mistakes?

Activity 8.6

We all know that fruit is good for us and that we don’t eat enough. In a
recent study done among a random sample of 75 teenage boys, the following
information was collected:
Fruit servings per day Number of boys % of boys
0 20 27
1 15 20
2 15 20
3 12 16
4 8 11
5 5 6
Total 75 100

173

Statistics_Method_BOOK.indb 173 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

1. What is the probability of teenage boys eating three fruit servings per day?
2. What is the probability of teenage boys not eating three fruits per day?
3. What is the probability that teenage boys will eat at least one fruit per day?

(Note: ‘At least one’ means one or more.)

8.4 Forming new events


Once some events have been specified, there are several useful ways of
manipulating them to create new events, known as compound events.

8.4.1 Union of two events (A or B)


The union of events A and B is represented by the symbol A∪B and merges all
the outcomes in event A with those in event B. An outcome is listed only once
in the combined event if it appears in both A and B.

8.4.2 Intersection of two events (A and B)


The intersection of two events is represented by the symbol A∩B and contains
all the outcomes that events A and B share in common – this is known as a joint
probability.

Example 8.7

In an experiment consisting of selecting one card from a deck of playing cards,


you may be interested in some simple possible events. For example, the card
selected may be:
• A 5 the queen of hearts
• B 5 a red card
• C 5 a king
• D 5 a face card
• E 5 an ace
Some compound events that you can form from the simple events listed above:
• B or E 5 you are interested in drawing either one of the 26 red cards or any one of
the possible four aces, of which two are red – that is a total of 28 possible outcomes
that will satisfy your event. Remember you can list an outcome only once.
• A or C 5 you are interested in the possibility that the card you select will either
be the queen of hearts or a king – that can be any one of five possible outcomes.

174

Statistics_Method_BOOK.indb 174 2014/12/18 3:01 PM


  Probability: basic concepts

• B and C 5 you are interested in the possible outcomes that will result in a red
king card – it can be one of two possible outcomes. Only two of the king cards
are red.

8.4.3 Display events graphically


Diagrams used to display events are known as Venn diagrams. It is a useful
technique to visualise relationships and in understanding probability.
The collection of all possible outcomes, that is the sample space, is shown as
the interior of a rectangle with the various events drawn as circles inside the
rectangle. Each event is a subset of the sample space.

Example 8.8

Out of 40 students, 14 are enrolled for Accounting (A) and 29 for Statistics (S).


Five students are enrolled for both subjects.
• Only one event is displayed
If one student is selected from this sample, what is the probability that the
student is enrolled for Statistics? What is the probability that the student is not
enrolled for Statistics?
1. Draw a Venn diagram for event: Students enrolled for Statistics.
2. Draw a rectangle to represent the 40 students.
3. Draw one circle inside the rectangle to represent the 29 Statistics students.
4. Sc: not enrolled for Statistics – complement rule.

Sample space: 40 students

Statistics (29)

Sc 5 (40 2 29 5 11)

29
5. P(Students enrolled for Statistics) 5 
​  40 ​ 

29
P(Students not enrolled for Statistics) 5 1 2  ​  11
​  40 ​ 5  40 ​ 

• The union of two events (S or A)


1. The events are not mutually exclusive; therefore draw two overlapping
circles in the rectangle and label each one.
175

Statistics_Method_BOOK.indb 175 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

2. The total area in the two circles displays the union of the two events (S∪A). It
consists of all the outcomes in either event Statistics (S) or event Accounting
(A) or both.
3. The area where S and A overlap is where both S and A occur.

40 students

Statistics Accounting
5
(29 2 5 5 24) (14 2 5 5 9)

4. If one student is selected from this sample, what is the probability that
the student is enrolled in at least one of the subjects? This consists of all
the students enrolled in Statistics or in Accounting or in both subjects.
(24 1 5 1 9 5 38)
38
P(S or A) 5 
​  40  ​

If one student is selected from this sample, what is the probability that the
student is enrolled in neither subject? The student is not enrolled in any of
the__ two__subjects, but forms part of the 40 group. (40 2 38) 5 2
2
P(​S​ or A
​ ​ ) 5 
​  40    ​

What is the probability that a randomly chosen student from this sample is
enrolled in Statistics only? That will exclude the students who are enrolled
in Statistics as well as in Accounting (29 2 5 5 24)
​  24
P(Statistics only) 5  40  ​

What is the probability that a randomly chosen student from this sample is
enrolled in
__
Accounting but not Statistics?
9
P(A and S​ ​ ) 5 
​  40    ​

• Intersection of two events – S and A: S∩A


1. The shaded area in this Venn diagram displays the intersection of two events
S and A.
2. This compound event is defined by the condition that both event S and event
A occur and consists of all outcomes common to both event S and event A.

176

Statistics_Method_BOOK.indb 176 2014/12/18 3:01 PM


  Probability: basic concepts

S A

S and A 5 (5)

3. What is the probability that a randomly chosen student from this sample is
enrolled in Statistics and Accounting?
5
P(S and A) 5 
​  40    ​

• Union of two mutually exclusive events: A or C


1. Two events are said to be mutually exclusive if, when one occurs, the other
cannot occur at the same time. The two events are disjoint and have no
outcomes in common.
2. If you select a card from a deck of playing cards, what is the probability that
you will select either a queen or a king?
4 4 8
P(queen or king) 5  ​  52    ​ 1 
​  52    ​ 5 
​  52    ​

Queen King

8.5  Probability rules for compound events


We can use various rules of probability to compute the probabilities of the more
complex, related events.
1. If an event (E) is made up by two (or more) simple events (A and B), the
probability P(E) can be formed either as:
• the union of two or more events – all the outcomes that make up the two events
• the intersection of two events – the outcomes that fulfil the conditions for
both events.
2. Two or more events are mutually exclusive if the occurrence of one event

177

Statistics_Method_BOOK.indb 177 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

means that none of the other events can occur at the same time. An outcome
can belong to event A or to event B but not to both.
If you flip a coin, you can have either heads or tails. Both can’t happen at
the same time!
3. Two events are independent if the occurrence of one is in no way affected by
the occurrence of the other; that is, they are unrelated.
If you flip two coins and you obtained heads on the one, it will have no
influence on the outcome of the second flip.
4. If there is a particular relationship between events such that the occurrence
of one event affects the occurrence of the second event, the events are
dependent. The probability attached to the occurrence of such events is
known as conditional probability.

8.5.1 Addition rules


The addition rules are used if we want to find the probability of one event or
another occurring.

Special rule of addition


To apply the special rule of addition, the events must be mutually exclusive – the
two events are disjoint and therefore cannot occur simultaneously.
This rule states that the probability of event A or event B occurring in a given
single outcome equals the sum of their probabilities. This is known as the union
of P(A) with P(B).
P(A or B) 5 P(A) 1 P(B)

Example 8.9

We all know that fruit is good for us and that we don’t eat enough. In a recent study
done among a random sample of 75 people, the following information was collected:
Fruit servings per day Number of people
0 20
1 15
2 15
3 12
4 8
5 5
Total 75

178

Statistics_Method_BOOK.indb 178 2014/12/18 3:01 PM


  Probability: basic concepts

15
• The probability that a selected person eats two fruits per day is 
​ 75 ​ 5 0.20
​ 12
The probability that a selected person will eat three fruits per day is 
75 ​5
  0.16
The two events are mutually exclusive because if the person eats exactly two
fruits per day he cannot eat exactly three fruits as well.
• The probability that a selected person will eat two or three fruits per day is
P(2 or 3) 5 P(2) 1 P(3) 5 0.20 1 0.16 5 0.36.
• Calculate the probability that a randomly selected person will eat at least four
fruits per day.

Activity 8.7

The probabilities that a wine tasting will rate a new shiraz as very poor, poor,
fair, good, very good or excellent are 0.05, 0.14, 0.17, 0.33, 0.20 and 0.11.
What are the probabilities that the shiraz will be rated as:
1. very poor or poor?
2. good, very good or excellent?

General addition rule


If the events are not mutually exclusive, it means that event A may occur or
event B may occur, or both A and B may occur in a single outcome.
This rule states that the probability that either event A or event B occurs
equals the probability that event A occurs plus the probability that event B
occurs minus the probability that both occur.

P(A or B) 5 P(A) 1 P(B) – P(A and B)

To avoid double counting the probability of the outcomes that fulfil the
conditions for both events, P(A and B) is subtracted from the sum of the
probability of A and B.

Note: In this rule P(A and B) denotes the probability that A and B both occur
in the same observation. In the multiplication rule P(A and B) denotes the
probability that event A occurs on one trial followed by event B on another trial.

179

Statistics_Method_BOOK.indb 179 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 8.10

The probability that a person stopping at a petrol garage will ask to have his tyres
checked is 0.12, the probability that he will ask to have his oil checked is 0.29
and the probability that he will ask to have both checked is 0.07. What is the
probability that a person stopping at this garage will ask to have:
• either his tyres or his oil checked?
P(T or O) 5 P(T) 1 P(O) 2 P(T and O)
5 0.12 1 0.29 – 0.07 5 0.34
• neither his tyres nor his oil checked?
1 2 P(T or O) 5 1 2 0.34 5 0.66

Activity 8.8

There are two secretaries in the office. The probability that the one secretary will
be absent on any given day is 0.08 and the probability that the other one will be
absent on any given day is 0.07. The probability that both will be absent is 0.02.
What is the probability that on a given day:
1. either or both secretaries will be absent?
2. at least one secretary comes to work?

8.5.2 Multiplication rules


These rules are used to determine the probability of event A and event B both
happening in two successive outcomes.

Special rule of multiplication


This rule states that when events A and B are independent, the probability of
both occurring together is the product of the individual probabilities.

P(A and B) 5 P(A) 3 P(B)

Two events are independent if the occurrence of one does not change the
probability of the next one occurring.

180

Statistics_Method_BOOK.indb 180 2014/12/18 3:01 PM


  Probability: basic concepts

Example 8.11

A survey among students found that 46% of them suffer stress at least once a
week. If three students are selected at random, find the following probabilities:
• That all three will say that they suffer stress at least once a week:
The three students are independent.
P(all three suffer stress) 5 0.46 3 0.46 3 0.46 5 0.097
• That none of them will say that they suffer stress:
If 46% suffer stress, then 54% will not suffer any stress, so
P(none of them suffer stress) 5 0.54 3 0.54 3 0.54 5 0.157

Activity 8.9

The quality control manager of a company questions the reliability of the two
quality control checks in the food processor manufacturing process. A worker
who manually checks the processors performs one check and a computer
monitor performs a second check. The manager knows that 5% of the time the
worker is apt to miss a defective processor and that 2% of the time the computer
will malfunction and fail to detect defective processors. What is the probability
that a worker will miss a defective processor and the computer will malfunction,
allowing a defective processor to leave the manufacturing process?

General rule of multiplication


This rule is used to combine events that are dependent. It is often referred to as
conditional probability.
Conditional probability is the probability of an event B occurring given that
event A has already occurred. The probability of the second event B is affected by
the outcome of the first event A.

P(A and B) 5 P(A) · P(B\A)

Conditional probability is denoted by P(B\A) and is read as the ‘probability of B,


given A happened’.

181

Statistics_Method_BOOK.indb 181 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 8.12

You are not aware of it, but in a case of wine bought, five of the 12 bottles are
bad. The given table lists all the possible outcomes in the experiment together
with the probability for each possible outcome.
G 5 good
B 5 bad
G1G25 first bottle good and second bottle good

1. To calculate P(G1G2):

7
There are seven good bottles in the case of 12  P(G1) 5 
​  12   ​ 

6
The probability of G2 if the first bottle is good (G1): P(G2\G1) 5 ​  11   ​.  (Remember
that if the first bottle selected is good, there are only 11 bottles left in the box and
only six good ones. The condition is: if the first bottle is good.)
All the possible outcomes Probability for each outcome

G1 G2 7 6 42
7 5 35
G1 B2 ​ 
12   ​ 3 ​  11   ​ 5 ​  132  ​ 
 

5 7 35
B1 G2 ​ 
12   ​ 3 ​  11   ​ 5 ​  132  ​ 
 

5 4 20
B1 B2 ​ 
12   ​ 3 ​  11   ​ 5 ​  132  ​ 
 

132
Total ​ 
132  ​ 5 1

If you were to select two bottles from the case, what is the probability that:

2. both bottles are bad?

5 4 20
P(B1B2) 5 
​  12   ​ 3 
​  11
   ​ 5 
​  132  ​ 

3. one of the two bottles is bad?


There are two possibilities that will give you this answer.

Note: The probability of an event is the sum of all the possibilities that will give
you the answer.

182

Statistics_Method_BOOK.indb 182 2014/12/18 3:01 PM


  Probability: basic concepts

P(G1B2) 1 P(B1G2) 5 ​ ​   (  7 5


) (  5 7
) 70
12    ​ 3 ​  11   ​  ​ 1 ​ ​  12   ​ 3 ​  11    ​  ​ 5 ​  132   
  ​

Example 8.13

A certain testing apparatus has two batteries. The probability that the first
battery will run down is 0.3 (B1) and the probability that both batteries (B1
and B2) will run down is 0.06. If the first battery is found to be flat, what is the
probability that the second battery will be flat?

P(B1 and B2) 5 P(B1) 3 P(B2\B1)

0.06 5 0.30 3 P(B2\B1)

0.06
 P(B2\B1) 5 
​  0.30  ​ 5 0.20

This means that 20% of the time, if the first battery is flat, the second battery
will also be flat.

Activity 8.10

A medical researcher has discovered a new test for tuberculosis. Experimentation


has shown that the probability of a positive test is 0.82, given that a person has
tuberculosis. The probability that the test registers positive and that the person
does not have tuberculosis is 0.04. Assuming that in the general population the
probability that a person has tuberculosis is 0.20, what is the probability that a
person chosen at random will:
1. have tuberculosis and a positive test?
2. not have tuberculosis?
3. have a positive test but not have tuberculosis?

8.5.3 Calculating probabilities using a contingency table


A two-way contingency table or cross-tabulation table lists the frequency of
each combination of the values of two variables.

183

Statistics_Method_BOOK.indb 183 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 8.14

Equity among the judges


The female judges in a certain province recently lodged a complaint about the
most recent round of promotions. An analysis of the relationship between
gender and promotion was undertaken with the joint probabilities given in the
following table:
Promoted Not promoted Total
Female 0.03 0.12 0.15
Male 0.17 0.68 0.85
Total 0.20 0.80 1.00

P(female) 5 0.15
P(male) 5 0.85
P(promoted) 5 0.20
P(not promoted) 5 0.80
P(female and promoted) 5 0.03
P(male and promoted) 5 0.17
P(female and not promoted) 5 0.12
P(male and not promoted) 5 0.68

1. What is the rate of promotion among female judges?


From this table you can see that, of all the judges in that province, only 0.15
or 15% are female. The rest, 85%, are male. To calculate the probability of
being promoted if you are a female, the following formula will be appropriate:
P(F and P) 5 P(F) · P(P\F)
0.03 5 0.15 ×P(P\F)
P(P\F) 5 0.20

2. What is the rate of promotion among male judges?


P(M and P) 5 P(M) · P(P\M)
0.17 5 0.85 ×P(P\M)
P(P\M) 5 0.20

3. Is an accusation of gender bias reasonable?


No, because the same proportion of the females are promoted as the males.

184

Statistics_Method_BOOK.indb 184 2014/12/18 3:01 PM


  Probability: basic concepts

Activity 8.11

The table shows the results of a study on 102 children in which a child’s IQ was
examined and the presence of a specific gene was found in the child.
Gene present Gene not present Total
High IQ 33 19 52
Normal IQ 39 11 50
Total 72 30 102

Determine the probability that a child has:


1. a high IQ and the given gene
2. a normal IQ without the gene
3. the gene
4. a normal IQ
5. a high IQ, given that the child has the gene
6. the gene, if his IQ is normal.

8.5.4 Tree diagrams


Another useful method of calculating probabilities if there are several stages or
trials in the experiment is to use a probability tree. All the possible outcomes of
the experiment are represented by the branches of the tree.

Steps
1. Plot a dot on the left to represent the root of the tree.
2. Construct a column for each trial.
3. Start on the left at the dot and determine the possibilities for the first trial,
which forms the branches of the tree in the first column.
4. Branches grow from each of the original branches, representing the
possibilities for the second trial. The second stage is based on the choice made
in the first stage. Determine if the outcomes are dependent or independent.
5. The branches of the tree are weighted by probabilities; therefore show the
probabilities for each event on the branches.
6. List all the outcomes together with the joint probability for each combined
outcome.

185

Statistics_Method_BOOK.indb 185 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

7. Add the probabilities. Because the tree represents the sample space of the
experiment, the sum of the probabilities should equal 1.

Example 8.15

A bag contains five red balls and three black balls. Two balls are drawn from the
bag. Construct a probability tree to list all the possible outcomes together with
each outcome’s probability.
1. The first stage of the tree consists of the possibilities in the first draw. There
are only red and black balls in the bag, therefore if you draw one ball from
the bag it can be either red or black. These possible outcomes are represented
by the branches of a tree.
2. If the ball in the first draw was red, there are still four red balls and three
black balls in the bag; therefore the second ball you draw can be either red
or black. If the first ball was black, there are still five red and two black balls
in the bag; therefore the second ball you draw can be either red or black.
3. The probabilities alongside each possibility must be calculated. In the first
5
stage if you draw a ball, the chance that it is red is  ​ 8  ​. The chance that the
3
first ball is black is ​ 
8  ​. Please note that the first ball can either be red or black.
Only one of the two possibilities can happen.
4. If the first ball is red, there are only four red ones left and seven balls in the
bag, therefore the chance that the second one is red is only  ​ 47  ​. But if the
3
second one is black the probability is  ​ 7  ​.
5
5. If the first ball was black, the chance that the second one is red is ​  7  ​or the
2
second one can also be black with an associated probability of  ​ 7  ​.
Stage 1 Stage 2
4 ​  5 4 20
r 5 ​  ​  8 ​ . ​ 
rr5 7 ​ 5 ​  56 ​ 

5 7
​  8 ​ 
r5 5 3 15
3 r b 5 ​ 
8 ​ . ​  7 ​ 5 ​  56 ​ 
​  7 ​ 
b5  

5 3 5 15
r 5 ​ 
7 ​  b r 5 ​ 
8 ​ . ​  7 ​ 5 ​  56 ​ 
 
3
​  8 ​ 
b5 3 2 6
2 b b 5   ​  8 ​ . ​  
7 ​ 5 ​  56   ​
 
b 5 ​ 
7 ​  ​ 
  
   ​ 
56
​  56 ​ 5 1
 

6. If you add the probabilities of all the possible events, the total must be 1.
20
7. From the tree diagram, the probability of drawing two red balls is  ​ 56 ​. 

15 15 30
The probability of a red and a black ball is 
​ 56 ​ 1 
​  56 ​ 5 
​  56 ​. 

186

Statistics_Method_BOOK.indb 186 2014/12/18 3:01 PM


  Probability: basic concepts

There are two possible outcomes that result in a red and a black ball. A
probability of an event is the sum of the probabilities of all the possibilities that
can result in the required event.

Activity 8.12

Approximately 10% of people are left-handed. If two people are selected at


random, use a probability tree to determine the probability that:
1. both are right-handed
2. both are left-handed
3. one is right-handed and the other, left-handed.

8.6 Counting the possibilities


Probability is based on the number of successes and the possible number of
outcomes that make up the numerator and denominator. A collection of rules
for counting the number of outcomes that can occur for a particular experiment
can be used.

8.6.1 Multiplication rule of counting


The multiplication counting rule (mn rule) states that if there are m ways in
which event A can happen, and n ways in which event B can happen, then there
are m 3 n ways in which both can happen. This rule can be extended if there are
more events.

m3n3...

Example 8.16

A computer password is to be made up consisting of four alphabetical characters.


How many different computer passwords can be designed if repetition of letters
is allowed?
There are 4 slots to be filled, one for each number in the password.
Thus there are 26 3 26 3 26 3 26 5 456 976 computer passwords.

187

Statistics_Method_BOOK.indb 187 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 8.13

If a restaurant menu had a choice of three salads, six main dishes and six
desserts, how many different possible dinners can be ordered?

8.6.2 Permutation rule


The permutation rule is used to determine the number of ways to arrange n
distinct objects taking them x at a time in a specific order.

n!
n
Px 5 
​  (n 2  x)! ​ 

Note: n! (pronounced n factorial) is the product of the whole numbers from n


downwards to 1. For example, 4! 5 4 3 3 3 2 3 1 and 0! 5 1. The factorial key
is available on most calculators.

Example 8.17

Suppose we need to select a group of 3 people from a larger group of 10. They
are to fill the roles of chairperson, secretary and treasurer in a committee. The
number of possible ways of filling these roles is:

10!
10
P3 5 
​  (10 2  
3)!
 ​ 
5 720

Activity 8.14

Assume there are five carriages that need to be unloaded at a dock, but there is
only enough time left in the day to unload three of them. Since the goods in each
of the carriages are needed by customers, the order of unloading is important.
In how many ways can three of the five carriages be unloaded in first, second
and third order?

8.6.3 Combination rule


The combination rule is used to determine the number of ways to select x objects
from a larger set of n objects without regard to the order in which the objects are

188

Statistics_Method_BOOK.indb 188 2014/12/18 3:01 PM


  Probability: basic concepts

selected. For example, ABC is considered the same selection as BCA or CBA. The
number of combinations of n objects taken x at a time is:

n!
n
Cx 5 
​  x(n 2 x)! ​ 

Example 8.18

A group of seven mountain climbers wishes to form a mountain climbing team


of five. How many different teams could be formed?

  7!
n
Cx 5 ​      ​ 
5!(7 2 5)!
5 21

Activity 8.15

You are given a list of 10 books and you are to read four of them. How many
possible combinations of four books are available from the list of 10?

TEST YOURSELF 8

1. There are six balls of the same size in a box: two are red, three are blue and
one is yellow. If you draw a ball from the box, what is the probability that
a) a red ball will be selected?
b) the ball will not be yellow?
c) the ball will be red or yellow?
2. This table shows the blood type of a randomly chosen person in South
Africa:
Blood type A O B AB
Probability 0.40 0.45 0.11 ?

a) What is the probability that a randomly chosen person has type AB


blood?
b) If you have type B blood and can receive blood transfusions from people
with blood types O and B, what is the probability that a randomly chosen
donor can donate blood to you?
3. The table below shows the results of a survey in which 500 adults were
asked why they don’t always eat healthy foods:

189

Statistics_Method_BOOK.indb 189 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Reason Number responding


No time to cook 175
Not available as take-aways 95
High cost 85
Poor taste 60
Hard to find 55
Confusion about nutrition 30

a) Find the probability of a randomly selected adult who doesn’t always eat
healthy foods because he or she has no time to cook or is confused about
nutrition.
b) Find the probability of a randomly selected adult who feels that healthy
foods have poor taste or are hard to find.
4. What is the probability that an even number will result from one roll of a
die?
5. If you draw a Smartie at random from a box of Smarties, you can draw one
of six possible colours.
Colour Blue Green Brown Orange Red Yellow
Probability ? 0.16 0.13 0.20 0.13 0.14

a) What is the probability of drawing a blue Smartie?


b) What is the probability that you will not draw a brown Smartie?
c) What is the probability that the Smartie you draw will either be yellow,
orange or red?
6. The probability that a car owner in a certain income bracket will drive a
Ford is 0.34 and the probability that he will drive a Toyota is 0.08. Find the
probabilities that such a person will:
a) not drive a Ford
b) drive a Ford or a Toyota
c) drive neither a Ford nor a Toyota.
7. The probability that a university student has hearing problems is 0.09; the
probability that a student has eyesight problems is 0.15. The probability
that a student will have a hearing and an eyesight problem is 0.01. What is
the probability that a randomly selected student will have a hearing or an
eyesight problem, but not both?
8. A welfare worker is studying the residents of a certain retirement community.
She finds that 20% of the residents receive disability payments, 85% receive
retirement incomes and 15% receive both disability and retirement incomes.

190

Statistics_Method_BOOK.indb 190 2014/12/18 3:01 PM


  Probability: basic concepts

If a resident is randomly chosen, what is the probability that the person


receives a disability payment or retirement income?
9. In a certain lottery the probability of drawing a number divisible by 2 is
​  12  ​, divisible by 3 is ​ 

1 1
3  ​and divisible by 6 is ​ 6  ​. What is the probability of drawing

a number that is divisible by either 2 or 3?
10. What is the probability of drawing either a heart or an ace from a deck of 52
playing cards?
11. An analysis of students’ records at a university revealed that 45% of the
students have an average C symbol and 25% have jobs; 10% of the students
have jobs and an average C symbol. What is the probability that a student
selected at random will have an average C symbol or have a job?
12. Of 100 individuals who applied for a lab technician position with a large
firm during the past year, 40 had some prior work experience and 30 had
a professional certificate. However, 20 of the applicants had both work
experience and a professional certificate. Determine the probability that a
randomly selected candidate had:
a) either work experience or a certificate
b) either work experience or a certificate but not both
c) a certificate, given that she had experience.
13. John is interviewed for a job at Karco. The probability that he will want the
job after the interview is 0.88. The probability that Karco will want him is
0.45. The probability that he will want the job if Karco wants him is 0.92.
a) What is the probability that John will want the job and that Karco will
want him?
b) What is the probability that Karco will want John if John wants the job?
14. The probability that a customer selects a pizza with mushrooms or pepperoni
is 0.43, and the probability that the customer selects mushrooms only
is 0.32. If the probability that he selects pepperoni only is 0.17, find the
probability of the customer selecting both items.
15. In a sample of 1 000 people, 120 are left-handed. Two unrelated people are
selected at random from the sample. Find the probability that :
a) both people are left-handed
b) neither person is left-handed
c) at least one of the two people is left-handed.
16. Out of 100 cars that start in the Grand Prix race, only 60 finish. The Total
team in the race enters two cars. What is the probability that:
a) both cars will finish?
b) neither of the two will finish?

191

Statistics_Method_BOOK.indb 191 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

17. If 18% of all South Africans are underweight, find the probability that if two
citizens are selected at random, both will be underweight.
18. Approximately 9% of men have a type of colour blindness that prevents
them from distinguishing between red and green. If three men are selected
at random, find the probability that all three will have this type of red–green
colour blindness.
19. The probability that a person has type O1 blood is 38%. Three unrelated
people are selected in a random sample. Find the probability that:
a) all three have type O1 blood
b) none of the three has type O1 blood
c) at least one has type O1 blood.
20. Students take two independent tests –30% of them pass test A and 60% pass
test B.
Find the probability that a student selected at random passes:
a) both tests
b) only test A
c) only one test.
21. Ten students are being interviewed for appointment to the students’ council.
Six of them are female and four are male. If two are selected at random for a
newspaper interview, what is the probability that:
a) at least one is female?
b) one female and one male are selected?
22. A person owns a collection of 30 CDs, of which five are classical music. If
two CDs are selected at random, find the probability that both are classical
music.
23. A doctor gives a patient a 60% chance of surviving bypass surgery after a
heart attack. If the patient survives the surgery, he has a 50% chance that
the heart damage will heal. Find the probability that the patient survives the
surgery and that the heart damage will heal.
24. The probability that Jack parks in a disabled parking zone and gets a parking
fine is 0.06. The probability that Jack cannot find a legal parking space and
has to park in the disabled parking is 0.20. On Monday, Jack arrives at the
shopping centre and has to park in the disabled parking zone. Find the
probability that he will get a parking fine.
25. A batch of 10 calculators has three defective calculators. What is the
probability that a sample of three calculators will have:
a) no defective calculators?
b) all defective calculators?

192

Statistics_Method_BOOK.indb 192 2014/12/18 3:01 PM


  Probability: basic concepts

c) at least one non-defective calculator?


26. Records were kept by the maintenance department of a large factory
concerning the lifetime, in days, of a particular drill bit:
Lifetime of bit Number of drill bits
Less than five days 6
Five and less than 10 days 12
10 and less than 20 days 34
More than 20 days 4

What is the probability that a drill bit issued at random will last
a) less than 10 days?
b) five or more days?
27. There are 100 students in a class, of whom 40 are male. When questioned,
60 students agreed that they were happy in the school and of these, 30 were
female. Construct a contingency table and find the following probabilities
for a randomly selected student:
a) male and unhappy
b) female and happy
c) given a female, she is happy
d) an unhappy student
e) if it is a male, he is happy.
28. A study done on a sample of 1 000 people to determine answers on gender
and dominant hand produced the following information:
Men Women Total
Left-handed 63 50 113
Right-handed 462 425 887
Total 525 475 1 000

If one person is selected at random, calculate the probability that the person:
a) is left-handed
b) is a woman
c) is a man or left-handed
d) if it is a man, he is left-handed
e) is a woman and left-handed.
29. The table below indicates the results of steroid tests given to 1 000 male and
500 female athletes who were randomly selected for testing.

193

Statistics_Method_BOOK.indb 193 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Test result Male Female Total


Positive 150 50 200
Negative 850 450 1 300
Total 1 000 500 1 500

What is the probability that a randomly selected athlete:


a) is male?
b) is female?
c) tested positive?
d) is female and tested negative?
e) is either male or tested positive?
f) if he is tested positive, he is male?
g) if it is a female, she tested positive?
30. A boutique owner buys from three companies, A, B and C. Last month’s
purchases are shown in the table below:
Product A B C
Dresses 24 18 12
Trousers 13 36 15

If an item is selected at random, what is the probability that it:


a) was purchased from company A or is a dress?
b) was purchased from company B or company C?
c) is a pair of trousers or was purchased from company A?
31. Consumers were surveyed on the number of visits to a new shopping centre
and whether the centre was conveniently situated. The recorded data are
summarised in the following table:
Visits Convenient Not convenient
Often 60 20
Occasional 25 35
Never 5 60
What is the probability that a randomly selected consumer:
a) visits the centre often and finds it convenient?
b) if it is convenient, visits it occasionally?
c) never visits it?
d) finds the centre not convenient?
32. Students are engaged in various sports in the following proportions:
Rugby – 30% of all boys
Cricket – 20% of all boys
Soccer – 20% of all boys

194

Statistics_Method_BOOK.indb 194 2014/12/18 3:01 PM


  Probability: basic concepts

Both rugby and cricket – 5%


Both rugby and soccer – 10%
Both cricket and soccer – 5%
All three sports – 2%
If a student is randomly chosen, use a Venn diagram to calculate the chance
that:
a) he will play at least one sport
b) he will be a rugby player or a soccer player
c) he does not play any sport.
33. Common sources of caffeine are coffee, tea and cola drinks. Suppose that
55% of students drink coffee, 25% drink tea and 45% drink cola. Additional
to that, 15% drink both coffee and tea, 5% drink all three, 25% drink both
coffee and cola, and 5% drink only tea. Draw a Venn diagram showing this
information and determine:
a) what percentage of students drink only cola
b) what percentage drink none of these beverages.
34. There are four blood types, A, B, AB and O. Blood can also be Rh1or Rh2.
Finally, a blood donor can be classified as either male or female. In how
many different ways can a donor’s blood be labelled?
35. Four wires (red, green, blue and yellow) need to be attached to a circuit
board by a robotic device. The wires can be attached in any order and the
supervisor wishes to determine which order would be fastest for the robot to
use. How many possible sequences of assembly must be measured?
36. A research biologist is studying the effects on green beans of fertiliser type,
temperature at time of application and water treatment after application.
She has four fertiliser types, three temperature zones and three water
treatments to test. Determine the number of plots she needs in order to test
each fertiliser type, temperature range and water configuration.
37. The access code for your office’s security system consists of four digits. Each
digit can be 0 through 9. How many access codes are possible if each digit
can be used only once and not repeated? What will the answer be if each
digit can be repeated?
38. A food processing plant packages all its food in clear plastic that is sealed.
The quality control for the packaging process checks that: (1) the mass
shown is correct, (2) the label is correct and (3) the package is properly
sealed. These processes can be done in any order. In how many different
ways can a package be cycled through the three inspection stations?
39. If a coin is tossed four times, how many different outcomes are possible?

195

Statistics_Method_BOOK.indb 195 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

40. Space shuttle astronauts each consume an average of 3 000 calories per day.
One meal normally consists of a meat dish, a vegetable dish and a dessert.
The astronauts can choose from 10 meat dishes, eight vegetable dishes and
13 desserts. How many meal combinations are possible?
41. A mail-order company sells eight different books. As part of a special
promotion, customers may select three different books to make up a package.
How many different packages are possible?
42. There are 15 qualified applicants for five trainee positions in a fast food
management programme. How many different groups of trainees can be
selected?
43. A food technologist must select three tests to perform on ice cream. He has a
choice of seven tests. In how many ways can he perform three different tests?
44. A new drug is in the test phase: the first phase involves five volunteers and
the objective is to test the safety of the drug. If eight volunteers are available
and five of them are to be selected, how many different combinations of five
volunteers are possible?
45. The CEO of a research centre has to reduce the management staff from
10 to seven. He wants to get rid of the oldest three. How many possible
arrangements are there of the management staff in order of age?
46. The Big Triple at the local race track consists of picking, in the correct
order, the first three horses in the ninth race. How many possible Big Triple
outcomes are there if the ninth race is run by 12 horses? What is the
probability that your ticket will be a winning ticket?
47. A rugby team must schedule a game with each of three other teams. There are
five dates available for games. How many different schedules can be arranged?
What is the probability that it will be scheduled on one specific day?
48. Suppose that 60 of 200 students have Statistics as a subject, 40 have
Accountancy as a subject and 25 have both subjects. Portray the given data
using a Venn diagram and answer the following questions:
a) How many students have only Statistics as a subject? Calculate the
probability of having only Statistics as a subject.
b) How many have only Accountancy? What is the probability of taking
only Accountancy as a subject?
c) How many have Statistics or Accountancy or both? Calculate the
probability attached to this answer.
d) How many students have neither of the two subjects? What is the
probability of taking neither of the two subjects?

196

Statistics_Method_BOOK.indb 196 2014/12/18 3:01 PM


UNIT
Probability
9 distributions

In Unit 8 you studied the computation of the probability of an event. This unit
introduces probability distributions and shows how to calculate probabilities
using a probability distribution.

After completion of this unit you will be able to:


• define a probability distribution
• distinguish between discrete and continuous random variables
• find the probability for a binomial investigation
• find the probability for a Poisson investigation
• find the probabilities for a normally distributed variable by transforming it into a
standard random variable.

In statistical experiments involving chance it is difficult to predict the exact


outcome of a variable and so it is known as a random variable. However, a
listing of all the possible outcomes of this random variable (x) can be done. Each
possible outcome has a particular chance or probability of occurring.
A probability distribution is a listing of all the possible outcomes a random
variable (x) can take, together with the probability of its occurrence. The sum of
the probabilities of all the possible outcomes is 1.
Probability distributions are classified as either discrete or continuous,
depending on the type of random variable.
• A random variable is discrete if it can assume a countable number of possible
values, such as 0, 1, 2, 3, etc.
• A continuous random variable has an infinite number of possible values; that
is, it can take on any value over a given interval of values.

It is important that you can distinguish between discrete and continuous random
variables because different statistical techniques are used to analyse each.

Statistics_Method_BOOK.indb 197 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

9.1  Discrete probability distributions

Constructing a discrete probability distribution

Steps
1. List all the possible outcomes of an investigation in a frequency distribution.
2. Add the frequencies of each possible outcome.
3. Find the probability of each possible outcome by dividing the frequency of
the outcome by the sum of the frequencies.
4. Make sure each probability is between 0 and 1 and the total of all the
probabilities is 1.
5. The mean of a probability distribution is referred to as its expected value
and denoted by E(x) 5 S[xP(x)]
6. In this text we will cover the binomial and Poisson discrete probability
distributions.

Example 9.1

A survey asked a sample of 200 people how many times they donate blood each
year. The results are summarised as a probability distribution. The random
variable (x) represents the number of donations for one year.
x f P(x) E(x)
0 60 0.30 0.00
1 50 0.25 0.25
2 50 0.25 0.50
3 20 0.10 0.30
4 10 0.05 0.20
5 6 0.03 0.15
6 4 0.02 0.12
Total 200 1.00 1.52

Interpretation: 0.3 of the people did not donate blood and 0.25 of the people
donated once.
The expected number of times someone will donate blood per year is 1.52.

198

Statistics_Method_BOOK.indb 198 2014/12/18 3:01 PM


  Probability distributions

9.1.1 The binomial distribution


Many investigations result in responses for which there are only two possible
outcomes, such as yes or no, accept or decline, pass or fail, male or female, etc.
These investigations are known as binomial investigations.
The binomial distribution describes the probability distribution resulting
from the outcome of a binomial investigation. A binomial investigation usually
involves repeating the basic investigation a fixed number of times (n), called
trials, and observing the number of times that the outcome of interest (success)
(x) occurs.
A binomial probability distribution allows us to calculate the probability of a
specific number of successes (x 5 0, 1, 2, . . ., n) x times in a certain number of
trials (n).
A binomial experiment has the following properties:
• The experiment must consist of n identical trials.
• The trials are independent. That is, the outcome of one trial does not affect the
outcome of any other trial.
• Each trial has one of two possible mutually exclusive outcomes: success or
failure.
• Each trial has the same probability (p) of a ‘success’.
• The probability of a failure is denoted by (1 2 p).
• The random variable (x) is the number of successes in the n trials of the
investigation.
• The probability distribution of x is given by:
n!
P(x) 5 
​  x!(n 2 x)! ​ 
. px . (1 2 p)n 2 x

Where:
x 5 the number of successes, 0, 1, 2, etc
n 5 number of trials or sample size
p 5 probability of success on each trial

Steps
1. Find the probability (p) of a success in each trial.
2. Find the number of trials (n).
3. Decide on the number of successes (x) for which you want to determine the
probability.
4. Substitute the values into the formula.

199

Statistics_Method_BOOK.indb 199 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 9.2

Suppose the probability is 0.2 that any given avocado will show measurable
damage when the temperature falls to 15 °C. Construct the binomial distribution for
a sample of five avocados if the temperature does drop to 15 °C.
p 5 0.2
n55
x 5 0, 1, 2, 3, 4, 5 (damage can occur in none of the five, one of the five . . .
up to all five)
x P(x)
0 0.3277
1 0.4096
2 0.2048
3 0.0512
4 0.0064
5 0.0003
Total 1.0000
5! . 0.20 . (1 2 0.2)5 2 0 5 0.3277
P(x 5 0) 5 
​  0!(5 2 0)! ​ 

5! . 0.21 . (1 2 0.2)5 2 1 5 0.4096


P(x 5 1) 5 ​     ​ 
1!(5 2 1)!

5! . 0.22 . (1 2 0.2)5 2 2 5 0.2048


P(x 5 2) 5 ​     ​ 
2!(5 2 2)!

5! . 0.23 . (1 2 0.2)5 2 3 5 0.0512


P(x 5 3) 5 ​     ​ 
3!(5 2 3)!

5! . 0.24 . (1 2 0.2)5 2 4 5 0.0064


P(x 5 4) 5 ​     ​ 
4!(5 2 4)!

5! . 0.25 . (1 2 0.2)5 2 5 5 0.0003


P(x 5 5) 5 ​     ​ 
5!(5 2 5)!

• The probability that none are damaged: P(x 5 0) 5 0.3277


• The probability that all five are damaged: P(x 5 5) 5 0.0003
• The probability that less than two are damaged:
Calculate P(x 5 0) and P(x 5 1) and add the answers.
P(x 5 0 or 1) 5 [P(x 5 0) 1 P(x 5 1)] 5 0.3277 1 0.4096 5 0.7373

Note: A probability is the sum of the probabilities of all the possibilities that will
give you the answer.

200

Statistics_Method_BOOK.indb 200 2014/12/18 3:01 PM


  Probability distributions

• The probability that at least two will be damaged:


P(x  2) 5 P(x 5 2, 3, 4 or 5) 5 1 2 P(x 5 0 or 1)
5 1 2 (0.3277 1 0.4096)
5 0.2627

Note: You can apply the complement rule or you can calculate P(x 5 2, 3, 4 or
5). It will give you the same answer. Use the complement rule if it will give you a
short-cut method to your answer.

Activity 9.1

A shoe store’s records show that 30% of customers making a purchase use a
credit card to make payment. This morning seven customers purchased shoes
from the store. What is the probability that
1. three customers will pay using a credit card?
2. at least two customers will pay using a credit card?
3. more than five customers will pay using a credit card?
4. exactly three customers will not use a credit card to pay?

9.1.2 Poisson distribution


The Poisson distribution is a discrete distribution and is useful for calculating
the probability that a certain number of successes (x 5 0, 1, 2, . . .) will occur
over a specific interval of time, area or other measurement. There is no specific
upper limit to the count (n is unknown), although a finite count is expected.

Characteristics
1. The number of successes that occur in one interval is independent of the
number of successes that occur in any other interval.
2. The probability that a success will occur in an interval is the same for all
intervals of equal size and is proportional to the size of the interval.
3. x is the count of the number of successes that occur in a given interval of
time or other measurement and may take on any value from 0 to infinity.
4. If x is a Poisson random variable, the probability distribution of x is given by

lx . e2l
P(x) 5 ​  ​
x!   

201

Statistics_Method_BOOK.indb 201 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Where:
x 5 0, 1, 2, . . . `
l (pronounced lambda) 5 number of successes in the given unit of
measurement
e 5 the base of natural logarithms (use the ex key on the calculator)

Example 9.3

If a company receives an average of three calls per five-minute period of the


working day, what is the probability that
1. no calls will be received during a randomly selected five minutes?
l 5 3 per 5 minutes

60 . e26
P(x 5 0) 5 
​  3!   
​ 5 0.0498

2. five calls will be received during the next 10 minutes?


l 5 3 per 5 minutes, therefore l 5 6 for 10 minutes

65 . e23
P(x 5 5) 5 ​  ​ 5 0.1606
5!   

3. at least two calls will be received during the next 2.5 minutes
l 5 3 per 5 minutes, therefore l 5 1.5 for 2.5 minutes

1.50 . e21.5
P(x 5 0) 5 ​  ​ 
0!   5 0.2231

1.51 . e21.5
P(x 5 1) 5 ​  ​ 
1!   5 0.3347

P(x  2) 5 P(x 5 2 or 3 or 4 or s. . .)


5 1 2 P(x 5 0 or 1)
5 1 2 (0.2231 1 0.3347)
5 0.4422

Note: There is no n value available and therefore no upper limit to the x counts.
If you are required to do a probability involving . or , you have to apply the
complement rule.

202

Statistics_Method_BOOK.indb 202 2014/12/18 3:01 PM


  Probability distributions

Activity 9.2

A tollgate operator has observed that cars arrived randomly at an average of


360 cars per hour. Calculate the probability that
1. only two cars will arrive during a specified one-minute period
2. at least three cars will arrive during a specified two-minute period.

9.2 Probability distributions for continuous


random variables
A continuous distribution is a distribution in which the x variable may assume
any value within a given range or interval. The most widely used continuous
probability distribution is the normal distribution.

9.2.1 The normal distribution


In general, many things are distributed with the characteristics of what we
call normal. That is, the majority of the occurrences are in the middle of the
distribution, but relatively few on each end. For example, there are relatively few
tall people and relatively few short people, but lots of people of average height in
the middle. That means the chance (or probability) that a person will be average
in height, or more or less in the middle of the distribution, will be much better
than the chance that any one person will be either very tall or very short.
Those events that tend to occur in the middle of the normal curve have a
higher probability of occurring than the ones in the extreme or tails.
The normal distribution plays a very important role in statistical inference.

Characteristics
• The graph for a continuous random x variable is a smooth curve.
• The curve is unimodal (single mode).

0.5 0.5

203

Statistics_Method_BOOK.indb 203 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

• The distribution is symmetrical around its mean and bell-shaped in appearance.


• The left- and right-hand tails of the distribution extend indefinitely.
• The x axis represents all the possible values of the x variable, which is infinite.
• Two parameters describe the normal distribution: the mean (µ) describes
where the distribution is centred and the standard deviation () describes how
much the curve spreads out around the centre.
• Probabilities for continuous variables correspond to areas under the normal
curve.
• The total area under the normal curve is equal to 1 or 100% (the sum of all
the probabilities of a probability distribution 5 1). This means that the areas
to the left and the right of the mean will each comprise 50% of the total area.

Calculating probabilities under the normal curve


The normal distribution with a mean (m) 5 0 and a standard deviation () 5 1
is called the standard normal distribution.

Steps
1. Any normal random variable can be converted to a standard normal random
variable by calculating the corresponding z score. The z score expresses the
difference between the value of interest (x) and the m in units of standard
deviation (z).
x2m
​     
z 5  ​ 
2. This z score shows the number of standard deviations that a specific value lies
to the right or left of the mean. Any x value smaller than the mean will have
a negative z score and any x value greater than the mean will have a positive z
score. A negative z score only indicates that the area is to the left of the mean.
3. The x axis of the graph becomes the z axis with z 5 0 in the middle.
4. The z score is used to find the area under the standard normal curve in
published tables.
5. Because the distribution is symmetrical, the normal table deals only with
positive z scores. The negative z scores will give exactly the same areas as the
positive ones; therefore we use the absolute value of the z score.
Note: Absolute value means that you ignore the sign of the value.
6. The area is always positive because all probabilities are positive.
7. Within the same number of standard deviations (z) from the mean all
normal distributions will have the same area or probability.

204

Statistics_Method_BOOK.indb 204 2014/12/18 3:01 PM


  Probability distributions

How to read an area from the Normal table using a z score

Steps
1. The first column in the table gives the z score to the first decimal place, and
the top row gives the second decimal for a z score.
2. For example, to find the area of a z score of 0.23:
a) find 0.2 in the first column
b) go across with this row up to the column headed with the second decimal
of 0.03
c) where the corresponding row and column intersect, the area is 0.0910
d) this is the area between a z score of 0 and 0.23 and is denoted as
P(0 # z # 0.23)
e) a z score of 1.02 or P(0 # z # 1.02) will correspond to an area of 0.3461.
3. This table always gives the area between the mean and the required z score.

Sample of the standard Normal table from Appendix 1


Z 0.00 0.01 0.02 0.03
0.0 0.0000 0.0040 0.0080 0.0120
0.1 0.0398 0.0438 0.0478 0.0517
0.2 0.0793 0.0832 0.0871 0.0910
0.3 0.1179 0.1217 0.1255 0.1293

1.0 0.3413 0.3418 0.3461 0.3485

9.2.2 Different areas under the normal curve


Steps
1. Draw the normal curve and indicate the mean in the middle.
2. Find the value of the x variable on the x axis and shade the desired area.
3. Calculate the z score using the z formula.
4. Use the standard normal table and the absolute value of the calculated z
score to find corresponding areas and probabilities.

205

Statistics_Method_BOOK.indb 205 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Area between μ and any x-value


(1) (2) (3)

 x x  x1  x2

Steps
1. Sketch (1): Calculate the z score and look up the value in the Normal table.
2. Sketch (2): Calculate the z score. The answer for z is negative; therefore use
the absolute value of z and look up the value in the Normal table.
3. Sketch (3): If the area to be determined falls on both sides of the mean:
• calculate the z score for the area between µ and the x value to the
right of m
• determine the z score for the area between µ and the x value to the left
of m
• look up the areas for the two z scores from the Normal table
• add the two areas.

Area between two x values on the same side of the mean


(1) (2)

x1 x2   x2 x1

Steps
1. Calculate the z score for the area between m and the larger x value.
2. Calculate the z score for the area between m and the smaller x value.
3. Look up the two z scores in the Normal table to obtain the two areas.
4. Subtract the smaller area from the larger area.

206

Statistics_Method_BOOK.indb 206 2014/12/18 3:01 PM


  Probability distributions

Area in any tail


(1) (2)

x1   x2

Steps
1. Calculate the z score for the area between µ and the x value.
2. Use the absolute z score to look up the area in the Normal table.
3. Subtract the area from 0.5 (the area from the mean to the end of the
distribution is 0.5).

Area to the right of any x-value, where x is less than the mean and the area to
the left of any x-value, where x is greater than the mean
(1) (2)

0.5 0.5

x1   x2

Steps
1. Calculate the z score for the area between µand the x value.
2. Look up the z score to get the area.
3. Add 0.5 to the area.

Find an unknown x value

Steps
If a probability or area is given and you are required to determine an unknown
x value:
1. Calculate the area between µ and the unknown x value.

207

Statistics_Method_BOOK.indb 207 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

2. If the known area falls in the tail of the curve, subtract the tail area from
0.5.
3. Compare this area with the areas in the body of the Normal table.
4. If the exact area is not listed, use the closest value.
5. Read the z score from the first column and top row.
6. Use this z score in the z formula to obtain the unknown x value.

Note: If the area falls to the left of µ, the z is negative, and if the area falls to
the right of m, the z is positive.

(1) (2)

0.5 0.5

?x   ?x

For example, if the area between µ and the unknown x value is 0.08, find the
closest area to 0.08 in the table – that is 0.0793 – and read the z score from the
first column and top row, that is 0.2 from the first column and 0.00 from the top
row. The z 5 0.20.

Sample of the standard normal table


z 0.00 0.01 0.02 0.03
0.0 0.0000 0.0040 0.0080 0.0120
0.1 0.0398 0.0438 0.0478 0.0517
0.2 0.0793 0.0832 0.0871 0.0910

Example 9.4

The time it takes a randomly selected employee to perform a task is normally


distributed with a mean value of 120 sec and a standard deviation of 20 sec.
1. The probability that a randomly selected employee will complete the task
between 100 and 130 sec:

208

Statistics_Method_BOOK.indb 208 2014/12/18 3:01 PM


  Probability distributions

100 2 120
​ 
z 5 
20  ​ 
 5 21.00 Area: 0.3413

130 2 120
​ 
z 5 
20  ​ 
 5 0.5 Area: 0.1915

P(100 # x # 130) 5 0.3413 1 0.1915 5 0.5328

100 120 130

2. The probability that a randomly selected employee will complete the task
between 75 and 100 sec:

75 2 120
z 5 ​  ​ 
20   5 22.25 Area: 0.4878

100 2 120
​ 
z 5 
20  ​ 
 5 21 Area: 0.03413

P(75 # x # 100) 5 0.4878 1 0.3413 5 0.1465

75 100 120

3. The probability that a randomly selected employee will complete the task
within 75 sec:
P(x # 75) 5 0.50 2 0.4878 5 0.0122

75 120

4. The probability that a randomly selected employee will complete the task in
more than 75 sec:
P(x  75) 5 0.50 1 0.4878 5 0.9878

209

Statistics_Method_BOOK.indb 209 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

5. The 10% of the employees who complete the task within the shortest time
are to be given advanced training. What task times qualify individuals for
such training?
• The fastest 10% will fall in the left-hand tail of the distribution because
those times will be the shortest.
• The area between the mean and the 10% in the tail is: 0.5 2 0.1 5 0.4
• Look up an area of 0.40 in the Normal table and obtain the z value. The
area you are interested in falls on the left of m, resulting in a negative z
score.
xµ x 2 120
z 5​  ​2 1.28 5 
    ​  20   

 x 5 94.4

This means that the applicants who complete that task in 94.4 seconds or less
will qualify for advanced training.

0.5 2 0.1
5 0.4

0.1

21.28 120

Activity 9.3

The lifetimes of a certain kind of battery have a mean of 300 hours and a
standard deviation of 35 hours. Assume that the lifetimes, measured to the
nearest hour, follow a normal distribution, and determine
1. the percentage of batteries that have a lifetime of more than 320 hours
2. the value above which the best 30% of the batteries lie
3. the proportion of batteries that have a lifetime from 250 to 350 hours
4. the proportion of batteries with a lifetime between 250 and 280 hours
5. the maximum lifetime below which the weakest 20% of the batteries will fall
6. the minimum lifetime above which the 15% of the batteries with the longest
life will fall.

210

Statistics_Method_BOOK.indb 210 2014/12/18 3:01 PM


  Probability distributions

TEST YOURSELF 9
1. A textile firm has found from experience that only 20% of the people applying
for a certain stitching-machine job are qualified for the work.
a) Construct the probability distribution for this investigation if five persons
are interviewed to find qualified persons.
b) What is the probability that at least two are qualified for the job?
2. Testing blood for HIV, the virus that causes Aids, gives a positive result with
probability of about 0.004 when a person who is free of HIV antibodies is
tested. A clinic tests three people who are all free of HIV antibodies.
a) Construct the probability distribution for this investigation.
b) What is the probability that you will get one false-positive result?
c) What is the probability that you will get more that one false-positive
result?
3. You read that one out of four eggs contains salmonella bacteria. If you use
six eggs in your chocolate cake, what is the probability that:
a) one of the eggs contains salmonella?
b) at most, two of the eggs contain salmonella?
4. If 40% of all patients have medical aid, what is the probability that, in a
sample of 10 patients:
a) exactly four will have medical aid?
b) at least four will have medical aid?
c) at most, four will have medical aid?
5. Shortly after being put into service, some buses of a certain type develop
cracks on the underside of the mainframe. A particular city has 20 buses of
this type, eight of which have cracks. If five buses are randomly selected for
inspection, determine the probability of finding:
a) exactly two buses with cracks
b) at most, two buses with cracks.
6. About 15% of the population is left-handed. Fifteen individuals are randomly
selected. What is the probability that:
a) three or fewer are left-handed?
b) one or more are right-handed?
7. According to the National Environmental Programme, air pollution
standards for particulate matter are exceeded an average of 5.6 days in
every three-week period. What is the probability that the standard is:
a) not exceeded on any day during a three-week period?
b) exceeded two days or more of a two-week period?

211

Statistics_Method_BOOK.indb 211 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

8. SA Flawless Steel Co produces stainless steel plating that has an occasional


defect once every 10 m2. What is the probability that:
a) 1 m2 of stainless steel plating will have no defects?
b) 2 m2 of stainless steel plating will have exactly one defect?
c) 5 m2 of stainless steel plating will have two or more defects?
9. In a local bakery the manager monitored customer arrivals for Saturdays.
She estimated the average number of customer arrivals per 10-minute
period to be 6.2. What is the probability of :
a) 10 customers entering during a half-hour interval?
b) six customers entering during a five-minute interval?
c) 15 customers entering during an hour?
d) at least one customer entering during a 10-minute interval?
10. A welding machine breaks down occasionally as a result of a particular
part that wears out, and these breakdowns occur on average four times per
eight-hour day. Find the probability that:
a) no breakdown will occur during a given day
b) at most, two breakdowns will occur during the first hour of the day.
11. The photocopier repair department of Papermate receives an average of two
calls for service per hour. What is the probability of:
a) receiving no service calls per hour?
b) receiving exactly two service calls in two hours?
c) receiving more than three service calls in one and a half hours?
d) receiving no service calls in the next half-hour?
12. An environmental study has shown that the daily average noise level on a
busy street follows a normal distribution with a mean of 37 decibels and
standard deviation of six decibels.
a) What is the probability that the noise level exceeds 46 decibels?
b) What decibel range contains the middle 95% of the distribution?
c) What is the probability that the noise level will be between 20 and 30
decibels?
13. Accurate labelling of packaged meat is difficult because of mass decrease
owing to moisture loss. Suppose that moisture loss for a package of chicken
breasts is normally distributed with a mean value of 4% and standard
deviation of 1%.
a) What is the probability that moisture loss is between 3% and 5%?
b) What is the probability that moisture loss is at most 4%?
c) What is the probability that moisture loss is at least 7%?
d) 90% of all packages have moisture losses below what number?

212

Statistics_Method_BOOK.indb 212 2014/12/18 3:01 PM


  Probability distributions

14. Manufactured items are sold in boxes that are stated to contain a mass of at
least 40 kg. The actual mass in a box varies with a mean of 41.2 kg and a
standard deviation of 0.8 kg.
a) Calculate the proportion of boxes whose mass is between 40  kg and
42 kg.
b) Calculate the mass below which 20% of the lightest boxes fall.
c) All boxes containing less than 40 kg are scrapped at a cost of R100 per
box. Calculate the scrapping cost associated with the packing of 50
boxes.
d) To what mean mass should the box contents be adjusted, with the
standard deviation unchanged, if only 1% of the boxes are to be
scrapped?
15. The production foreman of the Oros Fruit Company estimates that the
average sales of oranges are 4 700 and the standard deviation, 500 oranges.
Calculate the probability that sales will be:
a) more than 5 500 oranges
b) more than 4 500 oranges
c) less than 4 900 oranges
d) between 4 500 and 4 900 oranges
e) between 4 900 and 5 500 oranges.
16. Birth masses are normally distributed with a mean of 3 579 g and a standard
deviation of 500 g. What is the cut-off point for the lightest 2% of babies?
17. The Faber Co produces a pencil called Ultra-Light. Sales follow a normal
distribution with a mean of 457  000 pencils each year. Furthermore, 90%
of the time sales have been between 460 000 and 454 000 pencils. Estimate
the standard deviation of this distribution.
18. The average number of calories in a 50  g chocolate bar is 225. If the
distribution of calories is approximately normal, with a standard deviation
of 10, find the probability that a randomly selected chocolate bar will have
between 200 and 220 calories.
19. The thickness of bolts (mm) manufactured by a certain process follows a
normal distribution with a mean of 10  mm and a standard deviation of
1 mm.
a) What proportion of the bolts in the long run are at most 11 mm?
b) What proportion of the bolts will have thickness values between 7.5 mm
and 12.5 mm?
c) What proportion of bolts will have thicknesses that exceed 11.5 mm?

213

Statistics_Method_BOOK.indb 213 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

20. The amount of distilled water dispensed by a certain machine has a normal
distribution with a mean of 64 litres and a standard deviation of 0.78 litres.
What container size will ensure that overflow occurs only 0.5% of the time?

214

Statistics_Method_BOOK.indb 214 2014/12/18 3:01 PM


UNIT
Statistical inference:
10 estimation

The objective of most statistical studies is inference. Inferential methods


presented in the following two units use information contained in a sample to
reach conclusions about the characteristics of the population from which the
sample was drawn. There are two general procedures for making inferences
about populations: estimation and hypothesis testing.

After completion of this unit you will be able to:


• describe the purpose of sampling distributions
• find the confidence interval for the mean
• find the confidence interval for the proportion
• determine the size of the sample required in order to make estimates to a specified
degree of accuracy.

10.1  Statistics and parameters


Calculations based on a sample are known as sample statistics. The values
calculated from population information are referred to as population
parameters. Sample statistics are used to estimate population parameters.
To distinguish between the two, Greek letters will be used to refer to population
parameters and Roman letters will refer to sample statistics.
Statistic Parameter
Mean 
​x​  μ
Standard deviation s 
Proportion p p
Size n N

Statistics_Method_BOOK.indb 215 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

10.2  Sampling distribution of the means


A sampling distribution is a distribution of a statistic (such as the mean) of all
the possible samples of the same size selected from the population.
The sampling distribution theorem forms the foundation of inferential
statistics. The theorem states that if you draw all the possible different samples
of the same size (n) from a population, you can calculate the mean of each
sample and these means will most probably differ. If you list all these possible
samples together with their means, it is known as the sampling distribution of
the means.
Properties of the sampling distribution of the means
• you calculate the mean of all the different sample means, it is equal to the
If
population mean:
m ​x ​ 5 m
• The differences between the means are known as the variability in the
sampling distribution of the means and can be measured by the standard
error of the mean:

​x ​ 5  ​  ​n ​   ​ 
The larger the sample size, the smaller the standard error of the mean and the
better the estimate of the population mean because of the lesser dispersion.
• In most cases the population standard deviation (s) is unknown and the
sample standard deviation (s) is used as an approximation to the population
standard deviation (s).
The standard error of the mean s then becomes:
s
s​x ​ 5 
​  ​n ​    ​ 
• If the population is normally distributed with a known population s, the
sample distribution of the mean is also normal, regardless of the sample size.
• The central limit theorem states that if the population from which the sample
is drawn is not normal, the distribution of the sample means will become
more and more normal as the sample size increases. A sample of n  30 is
considered by most as being large enough to assume a normal distribution.
• In a population that is normal or close to normal with an unknown population
, the distribution of sample means is referred to as the student’s t distribution.
The sample standard deviation (s) is used in the calculation of the standard
error of estimate.

The t distributions are more dispersed than the normal distribution and
are distinguished by a positive whole number, called degrees of freedom:
df 5 n 2 1

216

Statistics_Method_BOOK.indb 216 2014/12/18 3:01 PM


  Statistical inference: estimation

Degrees of freedom (df) are defined as 1 less than the sample size (n – 1) and
represent the number of observations that are ‘free to vary’ around the mean
of the sample.

There is a different t distribution for each sample size. But as the sample
size gets larger, the t distribution becomes more and more normal. Once the
number of degrees of freedom reaches 30, the t distribution is so close to the
normal distribution that we can use the normal distribution to approximate
the t. That means that the t distribution will only be used for samples with
sizes of less than 30.

• Sometimes the parameter of interest is not the arithmetic mean, but a


proportion (p). The sample proportion (p) is used to estimate the population
proportion (p). Sample proportions will vary from sample to sample from a
given population in the same way that sample means vary.

The sample distribution of proportions is assumed to be normal if np  5 and


n (1 2 p)  5. The mean is:
mp 5 p
The standard error is:

p 5 ​ 
p(1 2 p)
​  ​ ​ 
n     

If p is unknown, p can substitute p:



p 5 ​ 
p(1 2 p)
​  n   
​ ​  

10.3 Estimating population parameters

10.3.1 Point estimation


A point estimate of a population parameter is done by making use of a single
value of a statistic that is based on sample data.
• The best estimator for the population mean m is the sample mean ​x​ .
• The best estimator for the population standard deviation  is the sample
standard deviation s.
• The sample proportion p is the best estimator for the population proportion p.
• The discrepancy between a sample statistic and its population parameter

217

Statistics_Method_BOOK.indb 217 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

is called sampling error. Measuring sampling error forms a large part of


inferential statistics.
These point estimates will almost never provide the exact value of the
population parameter, because the sample mean (​ x​ ) depends on which sample,
out of all the possible samples, was selected. Each possible sample mean (​x​ ) will
result in a different estimate for the population parameter.
Because of this sampling variability, a point estimate is much more useful if
it is accompanied by a probability to measure how close the sample mean is to
the population mean. This can be done by making use of a confidence interval
estimate. A confidence interval estimate is a range of values within which the
population parameter probably lies.

Example 10.1

The average number of hours per week spent on the internet was calculated for
a sample of 20 students as ​x​ 5 7.
The point estimate for the population mean will be: m 5 7 hours per week.

10.3.2 Confidence interval estimates


A confidence interval is an estimate of the range of values within which the
population parameter may, with some confidence, be expected to lie. This means
that the value of the parameter will be captured between a lower boundary and
an upper boundary at a chosen level of success.
The probability that the interval estimate contains the population parameter
is known as the confidence level (1 2 ). A high probability, say 99%, means
more confidence and hence a wider interval. However, if the level of confidence
is too high, the confidence interval will be too wide and the estimate may have
little real value. If the confidence interval is too narrow, on the other hand, the
estimate is associated with such low confidence that its value is questioned.

Components of a confidence interval


The confidence interval formula used to estimate the population mean (m)
associated with the normal z distribution:

218

Statistics_Method_BOOK.indb 218 2014/12/18 3:01 PM


  Statistical inference: estimation

(1 2 a)

​ 
a ​  ​  a ​ 

2 2
2z 1z

​x​ 6 z . 

m5 ​  ​n ​   ​ 

• ​x​ is the point estimator for m.


• z is the critical value associated with the chosen level of confidence.

• 

​  ​ n ​   ​ is the standard error of estimate.
• z . 


​  ​ n ​   ​ is known as the margin of error (E).
• The lower boundary of the interval is ​x​ 2 z .  

​  ​ n ​   ​ 
• The upper boundary of the interval ​x​ 1 z .  

​  ​ n ​   ​ 

Calculating a confidence interval

Steps
1. Collect a sample of an adequate size (n).
2. Compute the sample mean (​ x​ ) and standard deviation (s) or proportion (p).
3. Determine the type of sampling distribution:
a) normal (z) if population is normally distributed with known 
b) normal (z) via the central limit theorem with (n  30) and known 
c) normal (z) via the central limit theorem (n  30) if  is unknown
d) student t distribution for n , 30 and unknown 
e) normal (z) if dealing with proportions with np  5 and n(1 2 p)  5.
4. Identify the level of confidence (1 2 ): a 95% level implies that if 100
different confidence intervals are constructed, each based on a different
sample from the same population, we expect 95 of the intervals to include
the parameter and five not to include the parameter. We are capturing the
middle 95% between the two critical values and 2.5% in each tail.
5. Find the critical value z or t that corresponds to the level of confidence by
making use of the appropriate table (normal z table or student t table) and
the level of confidence. A critical value is the cut-off point between the
sample statistics that are likely to occur and those that are unlikely to occur.

219

Statistics_Method_BOOK.indb 219 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

2.5% of
2.5% of all
all sample
sample means
means will
will lie in this tail
lie in this tail
95% of all sample means
are between a z-score of
21.96 and 11.96

21.96 1.96

For a confidence level of 95%: the standard normal table is used to determine
a value z such that a central area of 0.95 falls between 2z 5 21.96 and 1z
5 1.96.
Identify the critical z value for a 95% confidence level:
• 95% or 0.95 covers the middle area of the curve. Do not look up 0.95
in the body of the Normal table because the Normal table contains
probabilities only for one half of the normal curve. Divide 0.95 by 2 to
obtain the area to the left or right of the mean (0.95 4 2 5 0.475).
• Look up 0.475 or the closest to this area in the body of the table to find
the corresponding z as 61.96 (6 because the area to the left of the mean
will result in a 2z and, to the right of the mean, a 1z).
6. Find the margin of error (E) which is the critical value multiplied by the
standard error of estimate:

​x​ 5 

​  ​  
n ​   ​ or ​ 
​ 
p(1 2 p)
n    ​ ​ 

7. Calculate the upper and lower confidence limits by making use of the
appropriate formula:
 s
• ​  6 z . 
m5
x​ ​  6 z . 
  ​ ​ or m 5 x​
​  ​n   ​  ​n ​    ​ 
s
• ​  6 t . 
m5
x​ ​  ​n ​    ​ 

p 5 p 6 z ​ 
p(1 2 p)
• ​  n   
​ ​  

8. Briefly state the meaning of your confidence interval: a confidence interval


indicates that, if we obtain many samples of size n from the population
whose mean (m), or proportion (p) is unknown, then approximately
(1 2 ) · 100% of the intervals will contain the population mean or proportion.

220

Statistics_Method_BOOK.indb 220 2014/12/18 3:01 PM


  Statistical inference: estimation

Activity 10.1

Identify the critical z values associated with confidence levels of 90%, 98% and
99%.

10.3.3 Confidence interval estimate for the population mean (μ) for
data obtained from a population that is normally distributed
or from large samples (n  30)
The central limit theorem states that if n is large (n  30), the sampling
distribution of the mean will be approximately normal. It does not matter
whether  is known or unknown or if the distribution is normal or not. If  is
unknown, substitute  with the sample standard deviation s.

Example 10.2

The Pappi Paper Company wanted to estimate the average time required for a
new machine to produce a ream of paper. A sample of 36 reams required an
average production time of 1.5 minutes for each ream. The population standard
deviation was 0.30 min and the confidence level was 95%.
1. The population distribution is not known to be normal, but via the central
limit theorem we assume a normal distribution.
2. To obtain the z value from the Normal table, divide the confidence level by
2: (0.95 4 2 5 0.475) and look up the area in the body of the Normal table.
An area of 0.475 corresponds to z 5 61.96
3. Use the sample standard deviation (s) to estimate the population .
4. Use the normal distribution formula to calculate the interval boundaries.
​x​ 6 z . 

m 5  ​  ​n ​   ​ 
0.3
m 5 1.5 1 1.96 . ​    ​ 
​ 36 ​ 
5 1.5 6 0.098
1.5 2 0.098 5 1.402 and 1.5 1 0.098 5 1.598
1.402 # m # 1.598
5. Based on the sample data, the Pappi Paper Company can be 95% confident
that the average time required for a new machine to produce a ream of
paper lies between 1.402 and 1.598 minutes.

221

Statistics_Method_BOOK.indb 221 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 10.2

In 36 randomly selected seawater samples, the mean sodium chloride


concentration was 23 cm3 per m3 and the standard deviation was 6.7 cm3 per
m3. Construct a 98% confidence interval estimate for the mean sodium chloride
concentration.

10.3.4 Confidence interval estimate for the population mean using


small samples (n , 30) with σ unknown: t distribution

Example 10.3

The number of home fires that were started by candles in low-cost housing areas
was recorded for a sample of seven years. The mean number of fires was 7 046
with a standard deviation of 1 605. Calculate the 99% confidence interval for
the average number of home fires started by candles.
• Find the critical values for 99% confidence and n 5 7 using the t table from
Appendix 2:
df 5 n 2 1 5 7 2 1 5 6

Use the t table to look up the critical values. The top rows of the table indicate a
one-tail test or a two-tail test at a specified significance level (). All confidence
interval estimates are two-tail tests. If the level of confidence is 0.99, the  value
is: (1 2 0.99) 5 0.01. Choose the 0.01 column under the two-tail row and go
down in this column to where it corresponds with the desired degrees of freedom
(df), which is 6. The df column is the first column in the table. This t table value
is 3.707.
s
m5 ​x​ 6 t . 
​  ​ n ​    ​ 
1 605
m 5 7 046 6 3.707 .  ​      ​
​ 7 ​ 
5 7 046 6 2 248.8
4 797.2 # m # 9 294.8

• Based on the sample data, we can be 99% confident that the mean number of
home fires per year will be between 4 797 and 9 295.

222

Statistics_Method_BOOK.indb 222 2014/12/18 3:01 PM


  Statistical inference: estimation

Activity 10.3

The time taken to complete the same task (in minutes) was recorded for nine
participants in a training exercise as follows:
8 7 8 9 7 7 9 10 9

Construct a 95% confidence interval for the average time taken to complete the
task.

10.3.5 Confidence interval estimate for the population proportion (π)

Steps
1. Identify the sample statistics n and x (or p).
x
2. Find the point estimate if not given: p 5 ​  n ​
3. Verify that the sampling distribution of p can be approximated by the normal
distribution: np  5 and n(1 2 p)  5
4. Find the critical value z that corresponds to the given level of confidence.
5. Use the formula to calculate the margin of error and the confidence interval
boundaries.

p 5 p 6 z ​ 
p(1 2 p)
​  n   
​ ​  

Example 10.4

A survey found that, out of 200 workers, 168 said they were interrupted three
or more times an hour by phone calls. Find the 90% confidence interval of the
population proportion of workers who are interrupted three or more times an
hour.

168
​  200  ​ 5 0.84
p5

Use the normal (z) distribution because np . 5

To obtain z: Look up an area of (90 4 2) 5 0.45



p 5 p 6 z ​ 
p(1 2 p)
​  n   
​ ​  

223

Statistics_Method_BOOK.indb 223 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

 0.84(1 2 0.84)
p 5 0.84 6 1.64 ​ 
​      
200 ​ ​ 

5 0.84 60.04
0.80 # p # 0.88
 80% # p # 88%

You can be 90% confident that the true proportion of workers that are interrupted
three or more times per hour is between 80% and 88%.

Activity 10.4

A medical researcher wished to determine the percentage of females who take


vitamins. A study of 180 females showed that 25% took vitamins. Construct a
99% confidence interval for the percentage of females who take vitamins.

10.4  Sample size (n)


An important question that needs to be answered in statistical investigation is:
how large a sample size is needed to guarantee a certain level of confidence for
a given margin of error?
Sample size varies inversely to the interval length: the larger the sample, the
shorter the interval length for a given confidence level. If the sample is too large,
the extra data collected will be a waste of money and effort because the same
results would have been obtained by a smaller sample. Similarly, if the sample is
too small, the resulting conclusions will be uncertain.
The correct sample size depends on three factors:
1. The level of confidence desired – as selected by the researcher.
2. The variability in the population being studied – if the population is widely
dispersed, a large sample is required, while a small dispersion would require
a smaller sample.
3. The maximum allowable error (E) – this is the maximum amount a point
estimate should, in the opinion of the researcher, differ above or below the
parameter being estimated, i.e. the difference between the sample mean and
the population mean.

Formula for determining the sample size when estimating the population mean:

z.
n 5 (​​  
​  E   ​  )​ ​
2

224

Statistics_Method_BOOK.indb 224 2014/12/18 3:01 PM


  Statistical inference: estimation

If  is unknown, you can estimate it using s.

Formula for determining the sample size when estimating the population
proportion:
p(1 2 p) . z2
​ 
n5 E2
  ​ 

Example 10.5

Consider a machine that is filling cans with tomato paste. Experience has shown
that in this process the population of fill masses is normally distributed with
a standard deviation of 0.31 g. The production supervisor wants to collect a
sample just large enough to provide a sample mean within 0.25 g of the true
process mean at a 99% confidence level.

z.
n 5 (​​  
​  E    ​  )​ ​
2

5​​ ( 2.58 . 0.31 )


​  0.25   
​  
​​ ​
2

5 10.23 ≈ 11

Activity 10.5

A cheese processing company wants to estimate the mean cholesterol content


of all 50 g servings of cheese. The estimate must be within 0.5 mg of the
population mean. Determine the minimum required sample size to construct
a 95% confidence interval for the population mean. Assume the population
standard deviation is 2.8 mg.

Example 10.6

In a sample of 100 invoices randomly selected from the debtors file, 12 were
found to be incorrect. How large a sample is necessary if we want the percentage
of incorrect invoices to be within 3% of the true population proportion at a 95%
confidence level?

225

Statistics_Method_BOOK.indb 225 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills
p(1 2 p) . z2
n 5 ​ 
E2
  ​ 

0.12(1 2 0.12) . 1.962


​   
5  0.032
  ​ 

5 450.74 ≈ 451

Activity 10.6

A daily newspaper is investigating the reading habits of the home delivery


customers. If a previous survey indicated that 50% of readers read the editorial
page, what sample size should be used to estimate, within 4%, the proportion of
people who read the editorial page if  5 0.10?

TEST YOURSELF 10

1. A survey of 100 customers passing through the check-out line of a


supermarket revealed a mean check-out time of 190 seconds and a standard
deviation of 60 seconds. Construct a 90% confidence interval for the true
mean checkout time.
2. The calories per 125 g serving of ice cream are recorded for a sample of
16 popular chocolate ice cream brands. The mean calories were found to
be 190 calories with a standard deviation of 40 calories. Construct a 98%
confidence interval estimate for the mean calorie content of chocolate ice
cream.
3. Noise levels at various hospitals in Gauteng are measured. The mean of the
noise level in 84 corridors was 61.2 decibels, and the standard deviation
was 7.9. Find the 99% confidence interval of the true noise level mean.
4. The Department of Health has been concerned about lead levels in South
African wines. In a previous testing of 40 wine specimens, lead levels of
600 parts per billion were recorded with a standard deviation of 50 parts
per billion. Estimate the true lead level for the wine using a 90% level of
confidence.
5. The tear strength of a particular paper product is known to be normally
distributed. If a random sample of nine rolls yielded a mean tear strength
of 225 kg/m2 with a standard deviation of 15 kg/m2, construct a 90%
confidence interval estimate for the average tear strength.

226

Statistics_Method_BOOK.indb 226 2014/12/18 3:01 PM


  Statistical inference: estimation

6. A sample of 12 households in Johannesburg showed a mean of R50


expenditure per day with a standard deviation of R20. If household
expenditure follows a normal distribution, construct a 99% confidence
interval estimate for daily household expenditure for all households in
Johannesburg.
7. Not all the town’s electricity accounts have been paid on time. The town
clerk takes a random sample of 16 from the outstanding accounts file and
finds the mean amount owed to be R230 with a standard deviation of R40.
If there are 100 outstanding accounts, find a 99% confidence interval
estimate for the mean amount in outstanding bills.
8. Fat content (in %) for 10 randomly selected hot dogs are listed below:
25.2 21.3 22.8 17.0 29.8 21.0 25.5 16.0 20.9 19.5
Construct a 90% confidence interval for the true mean fat percentage of hot
dogs.
9. For a group of 10 men subjected to a stress situation, the mean number of
heart beats per minute was 126 and the standard deviation was 4. Find the
95% confidence interval for the true mean.
10. A sample of 150 Gauteng residents found 65% to be in favour of fluoridation
of drinking water. Construct a 95% confidence interval for the true
proportion of Gauteng residents who favour fluoridation.
11. A sample of 200 manufactured items contains 40 defectives. Construct a
90% confidence interval for the true percentage of defectives.
12. You want to determine, with 98% confidence, the proportion of adults aged
20 to 29 that have high blood pressure. If a sample of 60 adults in this
age group showed 4% with high blood pressure, what will your confidence
boundaries be?
13. A cereal manufacturer has recently redesigned its product packaging. In a
random sample of 1 000 households prior to the change, the manufacturer
found that 220 were satisfied with the packaging. Estimate with a 98%
confidence the proportion of customers that were satisfied with the old
container.
14. In a survey of 80 adults it was found that 72 ate the recommended amount
of fruits and vegetables each day. Construct a 99% confidence interval for
the proportion of this population that follows these recommendations.
15. In a study to determine the proportion of adult males who have hypertension,
what sample size would be needed for the estimate to be within 3% at a
95% confidence level? A previous study showed 9% of adult males had
hypertension.

227

Statistics_Method_BOOK.indb 227 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

16. To study the proportion of residents who live in neighbourhoods with


acceptable levels of carbon monoxide, what sample size would be needed
for the estimate to be within 1.5% of the proportion with 90% confidence?
A previous study showed that 90% of the residents live in neighbourhoods
with acceptable levels of carbon monoxide.
17. Motorola wishes to estimate the mean talk time for its V505 camera phone
before the battery must be recharged. If s 5 31 minutes, how many phones
would Motorola need to test to estimate the mean talk time within five
minutes with 95% confidence?
18. The lives of light bulbs are normally distributed, with a known standard
deviation of 60 hours. If you want to estimate the sample mean within
three hours of the true population mean at a 95% confidence, what sample
size is needed?
19. A meat packer is investigating the marked mass shown on Vienna sausages.
A pilot study showed a mean mass of 11.8 kg per pack and a standard
deviation of 0.7 kg. How many packs should be sampled in order to be 98%
confident that the sample mean will differ by at most 0.2 kg?
20. A study is planned to determine the average annual family medical expenses
of government employees. A previously known standard deviation is R400
and the analyst wants to be 95% confident that the sample average is within
R50 of the true family expenses. How large a sample is necessary?
21. A publishing company wants to estimate the proportion of its customers
that would purchase TV programme guides. A 95% confidence is required
that the estimate is correct within 5% of the true proportion. If past
experience in other areas indicates that 30% of the customers will purchase
the programme guide, what is the sample size needed?

228

Statistics_Method_BOOK.indb 228 2014/12/18 3:01 PM


UNIT

11 Hypothesis testing

Methods for making inferences about population parameters fall into one of two
categories. In Unit 10 you studied how to estimate the value of the population
parameter of interest and in this unit you will learn how to test a claim or
hypothesis about a population parameter.

After completion of this unit you will be able to:


• explain the reasoning behind hypothesis testing
• explain the steps in the hypothesis testing procedure
• distinguish between a one-tailed and two-tailed test
• conduct tests of hypotheses concerning values of the following parameters:
–– population mean; large and small samples
–– population proportion
• conduct chi-square tests.

A hypothesis is a claim or statement about a population characteristic.


Hypothesis testing is a decision-making process to determine whether
enough statistical evidence exists to enable us to conclude that a belief or
hypothesis about a population parameter is reasonable. The fundamental idea
behind hypothesis testing procedures is this: the null hypothesis(H0) is rejected if
the observed sample is very unlikely to have occurred when H0 is true.
To make this decision it is necessary to decide whether the difference that
exists between the hypothesised population parameter and the sample result
is significant and therefore not supportive of the hypothesis, or whether the
difference is a chance difference and therefore supportive of the claim.
In the sampling process any one of the possible samples in the sampling
distribution might be selected. Most of the time this sample mean would not
equal the population mean. Such a difference is due to the sampling process and
is known as a chance difference. This difference is not large enough to cause
concern.

Statistics_Method_BOOK.indb 229 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Results are statistically significant if the difference between the sample result
and the statement made in the null hypothesis is unlikely to occur due to chance
alone. It indicates that the sample came from a population with a mean other
than the hypothesised mean.

11.1 a single sample classical hypothesis test

Steps
Understand the problem
1. Set up the null and alternative hypothesis.

Define the test procedure and decision rule


2. Select the significance level () for the test.
3. Determine the type of sampling distribution (z or t)
4. Determine the critical value(s) and corresponding rejection region.

Collect and analyse the data


5. Collect the data, and calculate the test statistic.

Draw conclusions and make recommendations


6. Make the statistical decision by comparing the test statistic with the rejection
region.
7. Interpret the statistical decision.

11.1.1 Stating the hypotheses


To test a population parameter you should identify a pair of hypotheses; one
that represents the claim (H0) and the other, its complement, you want to test
for (HA).

H0: population characteristic (m) 5 hypothesised value


• A null hypothesis, denoted by H0, is a statement of equality or no difference.
The ‘null’ implies that there has been no change or no difference in the value
of the parameter.
• The null hypothesis is assumed true until evidence indicates otherwise.

230

Statistics_Method_BOOK.indb 230 2014/12/18 3:01 PM


  Hypothesis testing

HA: state the alternative hypothesis


• The HA (or test hypothesis) states what the case will be if the null hypothesis
is not true.
• The value of the parameter appearing in HA must be identical to the one used
in H0.
• Either hypothesis – the H0 or the HA – may represent the original claim.
Different ways to set up the hypothesis
• A test in which we want to find out whether a population parameter (m or p)
has changed (≠), regardless of the direction of change, is referred to as a two-
tailed test.
H0: population characteristic 5 hypothesised value (will remain unchanged)
HA: population characteristic ≠ hypothesised value (will be different)
• If we wish to determine whether the sample came from a population that
has a parameter (m or p) less than or more than a hypothesised value, the
attention is focused on the direction of change, and the test is referred to as a
one-tailed test. It can be a left-tailed test or a right- tailed test.
H0: population characteristic 5 hypothesised value
HA: population characteristic , hypothesised value
or
H0: population characteristic 5 hypothesised value
HA: population characteristic . hypothesised value

Some common phrases that indicate the direction of test


. (right-tailed) , (left-tailed) 5 or ≠ (two-tailed)
Is greater than Is less than Is equal to or not equal to
Is above Is below Is the same as or different from
Is higher than Is lower than Has not changed or has changed
Is longer than Is shorter than
Is bigger than Is smaller than
Is increased Is decreased or reduced
Is at least Is at most
Is not less than Is not more than
An incline A decline

11.1.2 Select a level of significance (α) to be used


When you perform a hypothesis test you can make one of two decisions: reject
H0 or do not reject H0. Because this decision is based on a sample and not on

231

Statistics_Method_BOOK.indb 231 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

the population, there is always a possibility that the decision could be wrong.
The probability associated with this uncertainty is your level of significance.
Just as we place a level of confidence in the construction of an interval, we can
determine the probability of making errors. The significance level is chosen by
the researcher before the sample data are collected.
•  5 type I error and occurs if H0 is rejected when it is true
• type II error occurs if H0 is not rejected when it is false.
For example, when  5 0.10, there is a 10% chance of rejecting a true H0. You
can decrease the probability of rejecting H0 when it is actually true by lowering
the significance level.
The significance level is the maximum probability of making a type I error
and is denoted by .
• The purpose of the level of significance is to provide a probability basis for deciding
whether an observed difference between a sample statistic and a hypothesised
parameter is a chance difference or a statistically significant difference, since a is
the probability that the test statistic will fall in the rejection area.
• Usually tests are performed at an  value of 0.01, 0.02, 0.05 or 0.10.

11.1.3 Determine the type of sampling distribution


This step will enable you to know whether to use the normal z distribution or the
t distribution in determining the rejection area.
Use the normal z distribution:
• when the distribution is approximately normal with a known population
standard deviation ().
• when the sample size n  30 with an unknown or known population standard
deviation (s).
Use the t distribution:
• when the population standard deviation () is unknown and the sample size
n , 30. If n  30, the distribution approximates the normal curve via the
Central Limit Theorem and you use the normal z table.

11.1.4 Determine the critical value(s) and identify the rejection region
• The critical value represents the maximum number of standard deviations
that the sample mean or proportion can differ from the hypothesised value
before the null hypothesis (H0) is rejected.
• The critical value separates the area under the curve into two regions: the
non-rejection region and the rejection region. The rejection area(s) falls in the
tail(s) of the distribution.
232

Statistics_Method_BOOK.indb 232 2014/12/18 3:01 PM


  Hypothesis testing

• The rejection region (or decision rule) is a range of values such that, if the test
statistic falls into that range, we reject the null hypothesis (H0).

Steps
1. Specify the level of significance ().
2. Decide whether the test is two-tailed, left-tailed or right-tailed.
The HA is the indicator whether to perform a one-tailed test or a two-tailed test.
• If HA: m ≠ hypothesised value: two-tailed test
• If HA: m , hypothesised value: left-tailed test
• If HA: m . hypothesised value: right-tailed test
3. Find the critical value(s). One or two critical values are established on the
horizontal axis of the distribution, which serve as cut-off points between
the non-rejection and rejection areas.
• Critical values are expressed in the same measurement units as the test
statistic (z or t).
• A two-tailed test will have two critical values close to the two tails of the

curve. Each tail contains ​ 
2  ​% of the sample distribution means farthest from
the hypothesised mean. The critical values (expressed as a z or t score(s) will
correspond to an area ​( 0.5 2 ​  2  ​  )​from the mean.The critical value in the left

tail will take on a negative sign and in the right tail, a positive sign.
• A one-tailed test will have one critical value placed close to one side of the
curve. This one tail contains % of the sample distribution means farthest
from the hypothesis mean. The z or t score (from the corresponding table)
will correspond to an area of (0.5 2 ) from the mean.
4. Sketch the normal curve. Draw a vertical line at the critical value(s) and
shade the rejection region(s).
5. State the rejection region in words.

Example 11.1

Two-tailed (0.5 2 0.025 5 0.475) Left-tailed (0.5 2 0.05 5 0.45) Right-tailed (0.5 2 0.05 5 0.45)

2.5% 2.5% 5% 5%
0.475 0.475 0.45 0.45

21.96 0 1.96 21.64 0 0 1.64

233

Statistics_Method_BOOK.indb 233 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

1. A two-tailed hypothesis test at the 5% level of significance for a normal


distribution contains 2.5% of  in each tail. The area to look up in the
Normal table is (0.5 2 0.025) 5 0.475. (Remember you can only look up
an area from the middle of the distribution (m) to the cut-off point or critical
value.) If you look up 0.475 the corresponding z value is 21.96 to the left
and 11.96 to the right. These two values are your critical values.
The rejection region is given as:
reject H0 if the z test . 1.96 or if the z test , 21.96
2. A one-tailed test to the right at a 5% level of significance means all 5%
will go into the right-hand tail. The area to look up in the Normal table is
(0.5 2 0.05) 5 0.45. The corresponding z value for this area is 11.64. This
1.64 is your critical value.
The rejection region is given as: reject H0 if the z test . 1.64
3. A one-tailed test to the left at a 5% level of significance means all 5%
will go into the left-hand tail. The area to look up in the Normal table is
(0.5 2 0.05) 5 0.45. The corresponding z value for this area is 21.64. This
21.64 is your critical value. The rejection region is given as: reject H0 if the
z test , 21.64.

Activity 11.1

Find the critical value(s) and rejection region for a two-tailed test, a left-tailed
test and a right-tailed test for an approximately normal distribution at levels of
significance of 1%, 2% and 10%.

11.1.5 Conduct the statistical test


By making use of the appropriate formula for the sampling distribution we
calculate how many standard errors (z or t) the sample result is away from the
assumed population value.

Normal z distribution:
​
x​ 2 m
​      
z5 ​ 
​  ​n ​   ​
 

234

Statistics_Method_BOOK.indb 234 2014/12/18 3:01 PM


  Hypothesis testing

Normal z distribution (if your parameter is a proportion):


p2p
​  
z5   ​ 
​  
p(1 2 p)
​  p   
​ ​ 
t distribution:

​x​ 2 m
​  s    
t5 ​ 
​  
​
 n ​    ​

11.1.6 Make the decision


• The decision rule is a statement that indicates the action to be taken, that is,
do not reject H0 or reject H0.
• Compare the test statistic with the critical value(s) to see if it falls within the
limits of the rejection area or outside the limits.
• If the test statistic falls within the rejection region, the sample evidence does
not support the null hypothesis (H0) that the parameter was the specified
value and we say reject H0.
• If the test statistic falls in the non-rejection region, the sample evidence does
support the H0 that the parameter was the specified value and we say do not
reject H0.
• We never accept H0. Sample evidence can never prove the null hypothesis to
be true. There is just no statistical evidence to reject it.
• As long as conclusions are based on sample data, there is a chance that an
error could be made.

11.1.7 Interpret the decision


The conclusion should be stated in the context of the original claim. The level of
significance should be included and you would say there is enough evidence to
support the claim or there is not enough evidence to support the claim.

Example 11.2

A machine is set to release 30 g of dried fruit into a box of cereal moving along
the production line. A sample of 36 boxes revealed that the average mass of fruit
inserted was 30.3  g with a standard deviation of 0.5  g. Is the increase in the
amount of fruit inserted significant at the 0.01 level of significance?
1. H0: m 5 30
2. HA: m . 30 (indication of ‘more than’)
3.  5 0.01

235

Statistics_Method_BOOK.indb 235 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

The alternative hypothesis uses >, implying a right-tailed test.


The central limit theorem applies; therefore we use z distribution.
The area is (0.5 2 0.01 5 0.49) and corresponds with a critical value of
2.33.
The rejection area is: reject H0 if the z test . 2.33
4. Test:

​x​ 2 m 30.3 2 30
​  s    
z5 ​  0.5    
​ 5  ​ 5 3.6
​  
​
 n ​    ​ ​    ​
 
​ 36 ​ 
5.
0.5 2 0.01 5 0.49

Reject HO
Do not reject HO

1%

2.33

6. Since 3.6 . 2.33 we reject H0 at the 0.01 level of significance.


7. There is enough evidence to suggest that at a 1% level of significance there
is a significant increase in the amount of fruit inserted in a box of cereal.

Activity 11.2

A sample of 100 healthy adult males has a systolic blood pressure of 125 mmHg
with a standard deviation of 15. Test at a 2% level of significance whether the
mean systolic blood pressure is different from the generally accepted level of
130 mmHg.

Example 11.3

An assembling process takes an average time of 35 minutes. It is thought that


a certain modification would reduce this time, and after being modified, the
process is repeated 13 times, giving an average time of 33.3 minutes with a

236

Statistics_Method_BOOK.indb 236 2014/12/18 3:01 PM


  Hypothesis testing

standard deviation of 2.4 minutes. Is there any significant reduction in the time
at a level of significance of 0.05?
1. H0: m 5 35
HA: m < 35 (reduction in time is an indication of less than)

2.  5 0.05
The alternative hypothesis uses , so the test is a one-tail test to the left.
Use the t distribution because  is unknown and the sample size is small.

Reject HO
Do not reject HO

5%

21.782

3. To look up the critical t value you need to know the direction of the test and
the  value. Find the  in the one-tail test row of the t table if the test is one-
tailed, and in the two-tail test row if two-tailed. Move down in the chosen
column to the required number of df. (Remember df is (n 2 1).) The critical t
value is where the df value corresponds with the .

This example is a one-tail test at a 5% level of significance with 12 degrees


of freedom. In the one-tail test row, find the column with heading 0.05. Go
down that column to 12 degrees of freedom. The t value that corresponds
with this position is 1.782.

Reject H0 if the t test , 21.782



​x​ 2 m 33.3 2 35
4. Test: 5 
​  s     ​  2.4    
​ 5  ​ 5 22.55
​  
​
 n ​    ​ ​    ​
 
​ 13 ​ 
5. Since 22.55 , 21.782 we reject H0 at the 0.05 level of significance.

6. The sample evidence does suggest that there is a significant reduction of the
process time.

237

Statistics_Method_BOOK.indb 237 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 11.3

From past records we know that the average unbroken sleep periods of patients
with a certain kind of insomnia is 2.8 hours. A new drug is tested on a sample
of 25 patients and this yields an average of three hours of unbroken sleep with
a standard deviation of 0.8 hours. Is there a significant improvement on the
unbroken number of hours sleep? Test at  5 2.5%.

Example 11.4

Directors of a company claim that 90% of the workforce supports a new


shift pattern that they have suggested. A random survey of 100 people in the
workforce finds 85 in favour of the new scheme. Test at a 5% level if there is a
significant difference between the survey results and the directors’ claim.
1. H0: p 5 0.9
HA: p ≠ 0.9 (no indication of ‘more than’ or ‘less than’)
2.  5 0.05
3. The central limit theorem applies; therefore we use the normal z distribution.
The alternative hypothesis uses ≠, indicating a two-tail test.
4. Reject H0 if the z test . 1.96 or if z test , 21.96
p2p
5. Test: z 5  ​     ​ 
​  
p(1 2 p)
​  p    ​ ​ 
0.85 2 0.9
5 ​     ​ 

0.9(1 2 0.9)
​  
​  100    ​ ​  

5 21.67
Two-tailed (0.5 2 0.025 5 0.475)

2.5% 2.5%
0.475 0.475

21.96 0 1.96

6. Since 21.67 . 21.96 we do not reject H0 at the 0.05 level of significance.


7. You have enough evidence at the 5% level of significance to conclude that
90% of the workforce supports the new shift pattern.

238

Statistics_Method_BOOK.indb 238 2014/12/18 3:01 PM


  Hypothesis testing

Activity 11.4

To determine if new flavours of ice cream must be introduced into the market, a
random sample of 320 people was asked to taste and choose their favourite ice
cream flavour. Of the 320 people surveyed, 58 responded that they preferred the
chocolate flake flavour. If less than 25% of the sample prefers the new flavour,
it will not be used. Test the claim that less than 25% of people prefer chocolate
flake flavour ice cream at the  5 0.01 level of significance.

11.2 Hypothesis testing using the P value approach


In the hypothesis testing procedure there are two approaches to make a
conclusion: the classical approach, which was dealt with in 11.1, and the
probability or P value approach.
The P value of the test statistic is the probability that a sample statistic takes a
value equal to, or more extreme than, the one used in the hypothesis, when the
null hypothesis is true.

Steps
(Note: The first three steps and the last step are identical to the classical
approach.)
1. State the null hypothesis, H0, and the alternative hypothesis, HA.
2. Choose the level of significance ().
3. Determine the type of sampling distribution (normal z or t) and conduct the
test statistic.
4. Find the P value (or area) by looking up your test answer in the z table or t
table.
• For a one-tailed test, use the test statistic and look up the area in the
normal z or t table. This area is the P value.
• For a two-tailed test, use the test statistic and look up the area in the
normal z or t table. Multiply this area by 2 to obtain the P value.
5. Make the decision.
Compare the P value to  to determine whether or not to reject the null
hypothesis.
P value # : reject H0
P value . : do not reject H0

239

Statistics_Method_BOOK.indb 239 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

• A large test statistic, one that is larger than the critical value, is associated
with a small P value, one that is smaller than .
• Small P values will result in the rejection of H0 and large P values will result
in failing to reject H0.
• You can conduct the hypothesis test at different levels of significance. If, for
example, the P value of a sample is .025, then you reject the null hypothesis
at the 5% level of significance, but not at the 1% level of significance.
• The conclusion may be based upon the P value alone, which is the lowest level
of significance at which H0 can be rejected.
If the P value , 0.01: strong evidence against H0
If the P value is 0.01 to 0.05: some evidence against H0
If the P value . 0.05: insufficient evidence against H0
6. Interpret the decision in the context of the original claim.

Example 11.5

A cell phone operator’s manager believes that customer monthly cell phone bills
average more than R85 per month. To test this claim, a sample of 64 customer
cell phone accounts was randomly selected. The mean of the sample is found to
be R88 and the standard deviation R16.
1. H0: μ 5 85
HA: μ . 85
2.  5 0.05
3. Since our hypothesis involves the population mean and the sample size is
more than 30, the test statistic is z.
​x​ 2 m 88 2 85
4. z 5 
​  s     ​  16    
​ 5  ​ 5 1.5
​  
​
 n ​    ​ ​    ​
​ 64 ​ 

Use the Normal table to look up the area of the z test of 1.5. The area is 0.4332.
The HA is an indication to conduct a one-tailed test to the right of the
distribution.
The P value of z test 5 1.5 is 0.5 2 0.4332 5 0.0668. This means that the
probability of obtaining a sample whose mean is R88 or more when H0 is not
rejected is 0.0668.
5. Decision: Since 0.0668 is greater than the level of significance,  5 0.05,
we do not reject H0.
6. Conclusion: Based on the hypothesis, there is no evidence that the mean
customer cell phone account has increased significantly from R85 per month.

240

Statistics_Method_BOOK.indb 240 2014/12/18 3:01 PM


  Hypothesis testing

Example 11.6

Finding P value using t distribution table Appendix 2:


1. Locate the row in the table 5 df (n 2 1).
2. Look for the absolute test statistic in that row. Your observed t test will most
likely fall between two values of t.
3. Find the probabilities in the top row in either the one-tail or two-tail row.
The P value will be a range that falls between two probabilities.

The director of a private ambulance service company wants to perform a


hypothesis test, using a 0.05 level of significance, to determine whether their
goal of 15 minutes or less reaction time after a call is being achieved. A random
sample of 20 response times for medical emergencies was selected. The sample
mean is 14 minutes with a standard deviation of 3 minutes. What should he
conclude?
1. H0: μ 5 15
HA: μ , 15
2.  5 0.05
3. Since our hypothesis involves the population mean with an unknown
standard deviation and a sample size of less than 30, the test statistic is t.

​x​ 2 m 14 2 15
4. t 5 
​  s     ​  3    
​ 5  ​ 5 21.49
​  
​
 n ​    ​ ​     ​
 
​ 20 ​ 
5. Use the t table to look up the P value for the t test 5 1.49 with df 5 19.
It falls between a P value of . 0.05 and , 0.10. This mean that the
probability of obtaining a sample whose mean is 14 minutes or less when
H0 is not rejected is between 0.05 and 0.10.
6. Decision: Since the P value range is greater than the level of significance,
 5 0.05, we do not reject H0.
7. Conclusion: Based on the hypothesis, there is no evidence that the mean
reaction time is significantly less than 15 minutes per emergency call.

Example 11.7

The National Road Safety Council claimed that 50% of the accidents that occur
over the Easter weekend would be caused by drunk driving. A sample of 130
accidents over the Easter weekend showed that 70 were caused by drunk driving.

241

Statistics_Method_BOOK.indb 241 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Use these data to test the NSC’s claim at a 0.05 significance level.
1. H0: p 5 0.50
HA: m ≠ 0.50
2.  5 0.05
3. Since our hypothesis involves the population proportion and the sample size
is more than 30, the test statistic is z.
p2p 0.54 2 0.5
4. z test 5 
​     ​ 
5​      ​  
5 1.00
​   
p(1 2 p) 0.5(1 2 0.5)
​  ​ ​  ​  
n    ​  130  ​ ​ 

Use the Normal table to look up the area of the z test 5 1.00.The area is 0.3413.
The HA is an indication to conduct a two-tailed test.
The P value of z test 5 1.0 is: 0.5 2 0.3413 5 0.1587 per tail.
For both tails: 0.1587 3 2 5 0.3174
This mean that the probability of obtaining a sample whose proportion is at
least 54% when H0 is not rejected is 0.3174.
5. Decision: Since 0.3174 is greater than the level of significance,  5 0.05,
we do not reject H0.
6. Conclusion: Based on the hypothesis, there is no evidence that the proportion
of accidents caused by drunk driving is significantly different from the
claimed 50%.

11.3 Testing the difference among means and


proportions
In this section we expand the idea of hypothesis testing to two independent
samples. We select random samples from two different populations to determine
whether the two population means or proportions are equal. The same basic
steps for single-sample hypothesis testing are used, with a slight difference in the
null and alternative hypotheses and the formulae. The test is based on selecting
pairs of samples and comparing the means or proportions of the pairs.
State the H0 against the following HA:
If H0: m1 5 m2 then
HA:m1 ≠ m2 or
HA:m1 . m2 or
HA:m1 , m2
If H0: p1 5 p2 then
HA: p1 ≠ p2 or
HA: p1 . p2 or
H A: p 1 , p 2

242

Statistics_Method_BOOK.indb 242 2014/12/18 3:01 PM


  Hypothesis testing

Example 11.8

The management of a mine wishes to investigate the effect of the four-day work
week on absenteeism. Two random samples of 40 were selected; employees of
group A worked 10-hour days (four-day week) and group B worked eight-hour
days (five-day week). If group A averaged four hours of absenteeism per week
with a standard deviation of 1.2 and group B averaged 4.4 hours of absenteeism
per week with a standard deviation of 1.5, should we conclude that the shorter
work week reduces absenteeism? Set  5 0.05.
1. State the null hypothesis and the alternative hypothesis:
H 0: m 1 5 m 2
HA: m1 , m2 (indication of ‘less than’)
2. Select the level of significance:
 5 0.05
3. Formulate the decision rule:
The alternative hypothesis uses <, so the test is a one-tail test to the left. The
central limit theorem applies and we use the z distribution.
Reject H0 if the z test < 21.64
4. Determine the value of the test statistic:

1
​
x​  2 
2
​x​ 
​  
z5  ​ 


s1 s2 2
​ ​   ​  n2  ​ ​  
n  ​ 1  
1 2

4 2 4.4
​  
5   ​ 
​  
1.22 1.52
​  40   ​ 1  
​  40   ​ ​  

5 21.32
5. Since 21.32 . 21.64 we do not reject H0 at the 0.05 level of significance.
6. The sample evidence suggests that there is no significant evidence to
conclude that the shorter work week does reduce absenteeism.

Activity 11.5

A report on personal savings of 240 citizens of the Gauteng region showed that
the average annual savings was R9 300 with a standard deviation of R3 600.
The data for a sample of 150 citizens in the Western Cape region showed annual
savings of R8 400 with a standard deviation of R2 100. Test at a 5% significance

243

Statistics_Method_BOOK.indb 243 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

level whether there is a significant difference in the annual savings between the
two regions.

Example 11.9

In order to compare if the performance of two training methods are the same,
samples of individuals using each of the methods were checked. For the six
individuals from method one, the mean efficiency score was 35 with a standard
deviation of six. For the eight individuals in method two, the mean efficiency
score was 27 with a standard deviation of seven. Set  5 0.01.
1. H0: m15 m2
HA: m1 ≠ m2 (indication of two-tailed test)
2.  5 0.01
3. The alternative hypothesis sign is ≠, so the test is a two-tailed test.
Both samples are small, therefore we use the t distribution.
If two samples are used, the number of degrees of freedom will be:
n1 1 n2 2 2 5 6 1 8 2 2 5 12
Reject H0 if the t test . 2.681 or if t test , 22.681

1
​
x​  2 
2
​x​ 
4. Test: 5 
​  
       ​ 

(n 2 1)s 1 1 (n 2 1)s 2
1
​  
​    2 2
2 
n 1 n  2 
2 .
​ ​  ​  n1  ​ 1  
​   ​  n1  ​ ​  
1 2 1 2

35 2 27
5 ​ 
  
      ​ 

(6 2 1)62 1 (8 2 1)72
​  
​    2 
2 
6 1 8  ​ ​  ​  16 ​ 1  
. ​   ​  18 ​ ​  

5 2.24
5. Since 2.24 falls in the acceptance area, we do not reject H0 at the 0.05 level
of significance.
6. The sample evidence suggests that there is no significant difference in the
performance of individuals using the two training methods.

Activity 11.6

The manufacturer of two styles of shoes (A and B) wishes to test the hypothesis
that the average retail price of style A is less than the average price of style B.
A random sample of 12 retailers who stock style A yielded an average price of

244

Statistics_Method_BOOK.indb 244 2014/12/18 3:01 PM


  Hypothesis testing

R146 with a standard deviation of R12. A random sample of 10 retailers who


stock style B yielded a mean price of R160 with a standard deviation of R15.
Assume that the two samples come from two normally distributed populations
and test the hypothesis at a 5% level of significance.

Example 11.10

Workers in two different mining groups were asked what they considered to be the
most important problem they have with management. In group A, 200 out of a
random sample of 400 workers felt that a fair adjustment of grievances was the
most important problem. In group B, 60 out of a random sample of 100 workers
felt that this was the most important problem. Would you conclude that these two
groups differed with respect to the proportion of workers who believed that a fair
adjustment of grievances was the most important problem? Set  5 0.1.
1. H0: pA 5 pB
H A: p A ≠ p B
2.  5 0.10
3. Reject H0 if the z test . 1.64 or if the z test is , 21.64
1 2
p 2p
4. z test 5       ​ 
​  
( 
​ p^ . q^ . ​  
1 1
​  n 
  ​ 1  
1
)
​  n   ​  ​ ​  
2

n p 1n p
400(0.5) 1 100(0.6)
where p^ 5 ​ 
1 1 2 2
n 1n    ​  ​   
5    
400 1 100 ​ 
5 0.52
1 2

and 5 1 – 0.52 5 0.48

0.5 2 0.6
​  
5 
      ​ 
​ 0.52 3 0.48
1 1
  ​  
​  400
   ​ 1  ( 
​  100
   ​  ​ ​   )
5 21.79
5. Reject H0.
6. There is significant evidence to conclude that the two groups differ in their
beliefs.

Activity 11.7

The manufacturer of Munchy Breakfast Bars believes that her product will be
more popular in region Y than in region X, where it is currently being produced

245

Statistics_Method_BOOK.indb 245 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

and distributed. In order to check this hypothesis, a random sample was taken
from each region. The sample in region X contained 700 people, 560 of whom
claimed to prefer the taste of Munchy. In region Y, 525 of the 750 people sampled
responded favourably. Based on these results, does it appear that Munchy will be
more popular in region Y than in region X at a 2% significance level?

11.4 Tests using the chi-square distribution (x2)


The chi-square hypothesis tests covered in this unit are used for qualitative data
and involve counting the data items that fall into the testing categories.
1. The chi-square distribution (x2) is a continuous distribution.
2. The distribution is positively skewed but approaches the normal distribution
as the number of degrees of freedom increases.
3. It is used to test whether the observed frequency corresponds with the
expected frequency.
4. The x2 value is always positive.

Two basic tests that will be discussed here are the test for independence of
variables and the goodness-of-fit test.

11.4.1 Test for independence


This test is used to determine whether relationships exist between different
variables from a contingency table, or whether the variables may be considered
independent. The dimensions of such a table are described by identifying the
number of rows (r) and the number of columns (k) in the identity (r 3 k).

Steps
1. State H0 and HA: the null hypothesis states that the two variables are
statistically independent. This means that knowledge of the one variable
does not help in predicting the other variable.
H0: the variables are independent (no relationship)
HA: the variables are dependent (is a relationship)
2. Select the level of significance ().
3. State the decision rule by defining the rejection region.

You must know the level of significance () and the degrees of freedom (df) to
find the critical value from the x2 table in Appendix 3.

246

Statistics_Method_BOOK.indb 246 2014/12/18 3:01 PM


  Hypothesis testing

df 5 (number of rows 2 1) 3 (number of columns 2 1) from the contingency


table 5 (r 2 1)(k 2 1)
The top row of the x2 table shows the significance level and the first column
contains the number of degrees of freedom. Because the x2 distribution is
positively skewed, the critical value will always be positive and in the right-hand
tail of the curve.
The acceptance region for H0 goes from the left tail of the curve to the x2
critical value. To the right lies the rejection area.
You will reject H0 if the x2 test . x2 table value.

accept

reject

critical value

4. Calculate the value of the chi-square test by substituting cell by cell the
values from the fo and fe table into the formula:

(f 2 f ) 2
x2 5 S​     
o ​  e
fe

• Construct a table with columns showing the fo, fe and x2 value for each
entry.
• The observed frequencies (fo) are obtained from the sample data given in
the contingency table.
• In order to perform the chi-square test, expected frequencies (fe) are needed.
The (fe) for any given cell in the contingency table is the product of the total
of the frequencies observed in that row and the total of the frequencies
observed in that column, divided by the overall size of the sample.

row total 3 column total


fe 5 ​  
     
overall total ​

247

Statistics_Method_BOOK.indb 247 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 11.11

A random sample of adults was selected from each of four ethnic groups in Cape
Town. They were asked to specify their primary source of news. The results were
as follows:
Ethnic Group
A B C D Total
TV 30 20 25 20 95
Radio 25 25 20 20 90
Newspaper 10 10 5 30 55
Total 65 55 50 70 240

Is there a relationship between ethnic groups and the source of news at a 2.5%
level of significance?
1. H0: there is no relationship between ethnic group and source of news
HA: there is a relationship between ethnic group and source of news
2.  5 0.025
3. Reject H0 if the x2 test value . 14.449
(use df 5 (k 2 1)(r 2 1) 5 (3 2 1)(4 2 1) 5 6)
4. Test:
(f 2 f ) 2
fo fe x2 5 S​ o     ​ 
e
fe
30 25.73 0.71
25 24.38 0.02
10 14.90 1.61
20 21.77 0.14
25 20.62 0.93
10 12.60 0.54
25 19.79 1.37
20 18.75 0.08
5 11.46 3.64
20 27.71 2.15
20 26.25 1.49
30 16.04 12.15
240 240 24.83

248

Statistics_Method_BOOK.indb 248 2014/12/18 3:01 PM


  Hypothesis testing

5. Decision: Because the test statistic falls in the rejection region, the H0 should
be rejected at a 0.025 significance level.
6. Conclusion: There is evidence to suggest that there is a relationship between
ethnic group and source of news.

Activity 11.8

A manufacturer of women’s clothing is interested to know if age is a factor in


whether women would buy a particular garment, depending on its quality. A
researcher sampled three age groups and each woman was asked to rate the
garment as excellent, average or poor. Test the hypothesis, at a 5% level of
significance, that rating is not related to age group.
Age group
Rating 15–20 21–30 31–60
Excellent 40 47 46
Average 51 74 57
Poor 29 19 37

11.4.2 Goodness-of-fit tests


The x2 goodness-of-fit test for uniform distributions is used to determine whether
a set of sample data differs significantly from what is expected.

Steps
1. State the null and alternative hypotheses:
H0: The population under investigation fits some specified or expected
distribution.
2. Select the level of significance:  is the criterion used to formulate the
rejection area for H0.
3. Define the rejection region: to find the critical values from the chi-distribution
table you need the level of significance () and the degrees of freedom:
df 5 k 2 1 where k is the number of possible outcomes in the investigation.
Reject H0 if the x2 test statistic . x2 critical value.

249

Statistics_Method_BOOK.indb 249 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

(fo 2 fe)
2
4. Calculate the x2 test statistic: x2 5 S​     ​ 
fe
Where:
fo is the observed frequency from the sample data, and
fe is the expected frequency that is calculated to conform to the null
hypothesis that is being tested.
If the calculated x2 test statistic is zero, it means that the observed frequencies
and expected frequencies are identical, or exactly what we had expected.
5. Make the decision.
6. Interpret the decision.

Example 11.12

A manufacturer of soap wishes to know if consumers have a preference for


bath soap fragrances. To answer their question, a random sample of 200 adult
shoppers is offered a free bar of soap. The recipients may choose from among
four flavours. The choices are as follows:
Rose Lavender Sandalwood Lemon
66 53 45 36

1. H0: there is no preference, i.e. all flavours are equal.


HA: there is a preference in respect of flavour, i.e. flavours are not equal.
2. Level of significance:  5 0.01
3. Critical value: x2 5 11.345 using df 5 k 2 15 4 2 1 5 3 (k 5 number of
groups or intervals)
Reject H0 if the x2 test . 11.345
4. Test:

Flavour fo fe (fo 2 fe)2


x2 5 S​      ​ 
fe
Rose 66 50 5.12
Lavender 53 50 0.18
Sandalwood 45 50 0.50
Lemon 36 50 3.92
Total 200 200 9.72

• fe:if we expect no preference, the expected frequency for all flavours


should be equal (200 4 4 5 50).

250

Statistics_Method_BOOK.indb 250 2014/12/18 3:01 PM


  Hypothesis testing

• The fo column total should be the same as the fe column total.


• There is a rule that no fe should be less than 5. When this happens,
combine adjacent classes.
5. Decision: The x2 test statistic , 11.345 and it falls in the acceptance region.
There is no evidence to reject the H0 at a 0.01 significance level.
6. Conclusion: There is no evidence to suggest that there is a preference with
respect to fragrance.

Example 11.13

The respective car manufacturers’ shares of the national market are as follows:
Manufacturer % of market share
Volkswagen 37
Toyota 30
Delta 15
BMW 10
Mercedes 8

A random sample of 2  000 car owners in Pretoria revealed the following


ownership pattern: Volkswagen 758, Toyota 680, Delta 300, BMW 162 and
Mercedes 100.
Does the ownership pattern in Pretoria differ significantly from the national
pattern?
1. H0: the ownership pattern in Pretoria is the same as the national pattern.
HA: the ownership pattern in Pretoria differs from the national pattern.
2.  5 0.05
3. Reject H0 if the x2 test . 9.488 (df 5 k 2 1 5 5 2 1 5 4)
4. Test:
x2 5 S​ 
(f 2 f ) 2
Manufacturer National pattern(fe) Pretoria pattern(fo) o     ​ e
fe
Volkswagen 37% 5 740 758 0.44
Toyota 30% 5 600 680 10.67
Delta 15% 5 300 300 0
BMW 10% 5 200 162 7.22
Mercedes 8% 5 160 100 22.5
2 000 2 000 40.83

251

Statistics_Method_BOOK.indb 251 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

5. Decision: Because the test statistic falls in the rejection region, the H0 should
be rejected at a 0.05 significance level.
6. Conclusion: There is evidence to suggest that the pattern in Pretoria differs
from the national pattern.

Activity 11.9

A delivery of assorted nuts is labelled as having 45% walnuts, 20% hazelnuts,


20% almonds and 15% brazil nuts. By randomly picking several scoops of nuts
from the bag delivered, the following count was obtained:
Walnuts Hazelnuts Almonds Brazil nuts
92 69 32 42

Could these findings be a basis for an accusation of mislabelling at a 2.5% level


of significance?

TEST YOURSELF 11

1. Frequent checks were made of the spending patterns of tourists returning


from countries in Asia. Results indicated that travellers spent an average
of R1 010 per day. In order to determine whether there has been a change
in the average amount spent a sample of 70 travellers was selected and the
mean was determined as R1 090 per day with a standard deviation of R300.
Is there evidence of a significant increase in the mean amount spent per day
at the 0.01 level of significance?
2. The desired percentage of silicon dioxide in a certain type of cement is 5.0.
A random sample of 36 specimens gave a sample average percentage of
5.21 and a sample standard deviation of 0.38. Use a significance level of
0.01 and test whether the sample result indicates a change in the average
percentage.
3. A nutritionist claims that the mean tuna consumption by a person is 1.55 kg
per year. A sample of 60 people shows that the mean tuna consumption by a
person is 1.45 kg per year with a standard deviation of 0.51 kg. At  5 0.02,
can you reject the nutritionist’s claim?
4. A machine is set to fire 30 g of dried fruit into a box of cereal moving along
the production line. A sample of 36 boxes revealed that the average mass of

252

Statistics_Method_BOOK.indb 252 2014/12/18 3:01 PM


  Hypothesis testing

fruit inserted was 30.3 g with a standard deviation of 0.5 g. Is the increase
in the amount of fruit inserted significant at the 0.05 level of significance?
5. A company that makes cola drinks states that the mean caffeine content per
bottle of cola is 40 mg. The quality controller is convinced that it is lower. A
sample of 30 bottles of cola has a mean caffeine content of 39.2 mg with a
standard deviation of 7.5 mg. At  5 0.01, can the quality controller reject
the claim?
6. Hyperactive children are often disruptive in the typical classroom setting
because they find it difficult to remain seated for extended periods of time. The
typical number of ‘out-of-seat’ behaviours was 12.40 per hour. Treatment
was applied to a group of 25 hyperactive children and after treatment the
‘out-of-seat’ behaviours reduced to 11.60 per hour with a standard deviation
of 3.5. Using  5 0.01, can we conclude that this decline is significant?
7. Medical research has shown that repeated wrist extension beyond 20°
increases the risk of wrist and hand injuries. In each of 24 randomly selected
students in the information technology field, wrist extension was recorded
while using a mouse with a proposed new design. The sample mean was found
to be 24° with a standard deviation of 5°. Test the hypothesis that the mean
wrist extension for people using the new mouse design is greater than 20°.
8. You are involved in an environmental awareness programme and want to
test the claim that the mean waste generated by adults is more than 1.8 kg
per day. In a random sample of 15 adults you find that the mean waste
generated per person per day is 1.9 kg with a standard deviation of 0.54 kg.
At a 5% level of significance, is the claim justified?
9. A sample of 16 unflavoured ice cream tubs were selected at random and
subjected to chocolate flavouring. The sample mean time required to flavour
the ice cream was 13 minutes with a standard deviation of 2 minutes. Perform
a hypothesis test at the 1% level of significance to test that the population
mean time required to flavour ice cream is greater than 10 minutes.
10. A chicken producer claims that the average mass of a particular group of
chickens is 1 kg. Before agreeing to purchase, a customer selected a sample of
25 chickens, which yielded a sample mean of 1.12 kg and standard deviation
of 0.1 kg. If the masses can be considered to be normally distributed, should
the claim be rejected at the 1% level of significance?
11. A personnel manager claims that 60% of all single women hired for
secretarial jobs leave to get married within two years. An analysis shows
that of a random sample of 120 single women, 64 left to get married. Is this
evidence consistent with the company’s claim, at a 1% level of significance?

253

Statistics_Method_BOOK.indb 253 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

12. A plant is producing large numbers of water testing equipment of which, on


average, 2% are defective. In a random sample of 1 000, 3% are found to be
defective. Does this indicate a significant deterioration in the process? Test at
a level of significance of 0.02.

13. A company manufacturing salad dressings claimed that 85% of households
eat salad at least once a week. A nutritionist suspects that the percentage is
higher than this. She sampled 200 households and found that 170 of them
eat salad at least once a week. Conduct a test to address the nutritionist’s
suspicions. Use  5 0.10.
14. Club 60 claim that senior citizens participating in some sort of exercise have
a blood pressure lower than the average of 160 mmHg. To test this claim, 20
active senior citizens were selected at random and their blood pressure was
found to average 151 with a standard deviation of 12. Is Club 60’s claim
valid at a 10% level of significance?
15. A manufacturer claims that his market share is 60%. However, a random
sample of 500 customers reveals that only 275 are users of his product. Test
the claim at the 2% level of significance.
16. The sales manager wants to determine if the average size of orders received
by the company’s eastern branch differs significantly from the average size
of orders received from the western branch at a 2% level of significance.
A random sample of 90 orders from the eastern branch had a mean value
of R131.60 with a standard deviation of R25.80. A random sample of 55
orders received by the western branch had a mean value of R115.70 with a
standard deviation of R32.23.
17. A large bank is affiliated with both the Mastercard and Visa credit cards. For
a sample of 100 Mastercard holders it is observed that the average month-
end account balance is R680 with a standard deviation of R300. For a
random sample of 100 Visa cardholders the average month-end account
balance is R550 with a standard deviation of R265. Is the average for the
Visa cardholders significantly lower than the Mastercard average? Test at
 5 0.10.
18. A consumer testing service compared gas ovens to electric ovens by baking
one type of bread in five ovens of each type. The gas ovens had an average
baking time of 0.9 hours with a standard deviation of 0.09 hours and the
electric ovens had an average baking time of 0.7  hours with a standard
deviation of 0.16 hours. Test the hypothesis that the baking times are the
same in both kinds of ovens at the 5% level of significance. Assume the
baking times are normally distributed.

254

Statistics_Method_BOOK.indb 254 2014/12/18 3:01 PM


  Hypothesis testing

19. In order to conduct a consumer behaviour survey, a sample of 500 residents


was selected in a metropolitan area. One of the questions asked was ‘Do
you enjoy shopping for clothing?’ Of 240 males, 136 answered yes. Of 260
females, 224 answered yes. Determine whether there is evidence that the
proportion of females who enjoy shopping for clothing is higher than the
proportion of males, using a 5% level of significance.
20. In an experiment to compare the fracture toughness of high-purity steel
with commercial-purity steel of the same type, 32 specimens were selected
from each type. The sample mean and standard deviation toughness for
the high-purity steel specimens were 65.6 and 1.4 respectively. The sample
mean and standard deviation toughness for the commercial-purity steel
specimens were 59.2 and 1.1 respectively. Test at a 5% level of significance
whether a significant difference exists between the two types.
21. A supermarket chain is interested in determining whether a difference
exists between the mean shelf life (in days) of two different brands of bread.
Random samples of 50 freshly baked loaves of each brand were tested with
the results shown below:
Brand A Brand B
Sample mean 4.1 5.2
Sample standard deviation 1.2 1.4

Is there sufficient evidence to conclude that brand B has a longer shelf life
than brand A at a 2% level of significance?
22. In a public opinion survey, 60 out of a sample of 100 high-income voters
and 40 out of a sample of 75 low-income voters supported a decrease in
VAT. Can we conclude at a 5% level of significance that the proportion of
voters favouring a decrease differs between high- and low-income voters?
23. In an Aids awareness programme, it was found that 110 males in a random
sample of 310 males were aware of Aids. In another similar programme, it was
found that 87 women in a random sample of 290 women were aware of Aids. Test
at the 2% level of significance whether the first campaign was more successful.
24. Tests have been carried out on the effects of three fertilisers on sugar cane
growth. Each fertiliser was tried on several different plots of land. Each
value is a number of plots of land.

255

Statistics_Method_BOOK.indb 255 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Fertiliser
A B C
Strong growth 94 124 44
Weak growth 50 96 38

Test for an association between the choice of fertiliser and plant growth at a
1% level of significance.
25. A car manufacturer is interested in predicting purchase patterns for a new
small capacity car they are producing. The car comes in four colours and
the manufacturer wants to relate colour preference to the gender of the
purchaser. Use the following sample data and do the hypotheses tests at a
10% level of significance.
White Green Red Silver
Male 260 240 175 420
Female 130 200 240 340

26. Two different manufacturers supply parts for a production process. Each
part is tested for six possible defects. The following table shows the number
of each type of defect by each supplier:
Defect
Supplier 1 2 3 4 5 6
A 35 10 10 2 5 10
B 45 20 0 10 15 20

Would you conclude that the defect is independent of the supplier, using a
2.5% level of significance?
27. A sales manager has become interested in the number of sales calls made
by each of the employees. He reasoned that if all the employees are working
equally hard, they should make the same number of calls during a set period
of time. In order to investigate this hypothesis, the manager used a sample
of five employees and recorded the number of calls they made during a set
time period:
Employee A B C D E
No of calls 31 62 59 40 58

At a 1% level of significance, is the manager’s idea supported?

256

Statistics_Method_BOOK.indb 256 2014/12/18 3:01 PM


  Hypothesis testing

28. An accountant for a department store knows from past experience that
23% of customers pay cash for their purchases, 35% write cheques and
the remaining 42% use credit cards. A random sample of 200 sales receipts
during a month-end week was examined and the following results were
obtained:
Cash Cheque Credit card Total
Number of customers 37 47 116 200

Are the customers’ payment methods still the same as before? Use  5 0.05.
29. The manager of a local Spar counted the number of customers using the
store’s five checkout lanes during Friday and Saturday of a certain week.
The results were as follows:
Checkout Lane
1 2 3 4 5
Number of customers 160 200 300 120 100

Lane 5 is closed much of the time because it is used during busy times only.
The manager suspects, prior to taking the actual count, that lane 5 will be
used half as often as lanes 1, 2 and 4. Checkout lane 3 is the express lane
and is used by twice as many people as lanes 1, 2 and 4. Test the manager’s
belief that certain lanes are used more than others using  5 0.01.
30. Two companies have recently conducted aggressive advertising campaigns
to maintain or increase their respective market shares for a particular
product. Before the campaigns the market share of company A was 45%,
while company B had a share of 40%. Other competitors accounted for the
remaining 15%. To determine whether these market shares changed after
the campaigns, a market analyst determined the preferences of a random
sample of 200 customers of this product. Of the sample, 100 indicated a
preference for company A’s product, 85 preferred company B’s product and
the remainder preferred another competitor’s product. Conduct a test to
determine, at a 2.5% level of significance, whether the market shares have
changed from the previous levels.
31. A computer store owner feels that 50% of her customers purchase word-
processing programs, 25% purchase spreadsheet programs and 25%
purchase games. A sample of purchases shows the following distribution:

257

Statistics_Method_BOOK.indb 257 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Word-processing Spreadsheet Games


38 23 19

Test her assumption at a 10% level of significance.


32. Supermarket chains often carry products with their own brand labels and
usually price them lower than the other brands. A supermarket conducted
a taste test to determine whether there was a difference in taste between the
four brands of ice cream it carries: own brand (A) and three other brands
B, C and D. A sample of 200 people participated and they indicated their
preference as shown in the table below:
A B C D
39 57 55 49

Test at a 5% level of significance if there is a difference in preference for the


four brands.

258

Statistics_Method_BOOK.indb 258 2014/12/18 3:01 PM


PART 2
Calculation Skills

Statistics_Method_BOOK.indb 259 2014/12/18 3:01 PM


Statistics_Method_BOOK.indb 260 2014/12/18 3:01 PM
UNIT Elementary

12 calculations

In this unit you revise your calculation skills and how to utilise your calculator.

After completion of this unit you will be able to:


• classify the numbers you are dealing with
• understand common mathematical notation
• apply rounding rules to the results of calculations
• deal with additions, subtractions, multiplications and divisions
• deal with signed numbers
• understand and use exponents, square roots, logarithms, factorials and summation
• deal with fractions, decimals and the metric system.

The purpose of this course is to provide you with the numeracy skills to
understand the basic principles of business calculations and make sound
decisions based on them.
These skills will benefit you in other subjects, in a business career, and even in
the everyday business of living.

12.1 The electronic calculator


1. Power switch: all calculators need to be switched ON; the power supplied
by the batteries or electricity allows the user to enter, display, calculate and
store values. When a calculator is switched on the display screen should
light up and display a set of numbers, usually a zero, or a zero and a number
of zeros after the decimal point. A feature of most modern calculators is that
the display becomes blank after several minutes of non-use. This is due to an
automatic ‘power-off ’ function, which is designed to prolong the life of the
battery by turning off the display. Once this occurs, power can be restored
either by switching the power of the unit OFF and then ON again, or by
pressing the AC key.

Statistics_Method_BOOK.indb 261 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

2. The face of a calculator normally consists of two parts:


• The display screen, where numbers and/or letters are digitally displayed
for you to read. The display window shows calculation values and results.
Some calculators have a two-line display showing the calculation formula
on the first line and its answer on the second line.
• The keypad, where there are a number of keys (buttons) through which
the calculations are performed. Depending on the type of calculator used,
every key has an inscription on it, indicating the function it performs.
There may also be other inscriptions on the keypad itself, above or below
the different buttons. These inscriptions are usually coloured differently,
indicating that more than one function is assigned to that key (the same
button can perform another function).
• To invoke the other function, the SHIFT, 2nd, or INV key has to be pressed
prior to pressing the key itself. This method of multiple functions assigned
to individual keys helps to keep a calculator small, compact and versatile.

3. The clear or cancel key:


• Always make sure that you ‘clear’ the display screen before starting a new
calculation by touching the ‘cancel’ (C , CLR or AC) key on the calculator.
• If you enter a series of values for a calculation and have typed an incorrect
value, you correct it without destroying the previous intermediate
calculations by pressing the C key or, on some calculators, an arrow:
◀ or ▶. It clears the last entry and waits for you to continue.
• The AC key clears any pending calculations and resets the display to ‘0’.
Any values in the other memory locations are unaffected by it.
• Data held in the independent memory (M), as well as MODE specifications,
are held in memory even when power is turned OFF. Pressing the SHIFT,
2nd, or INV key followed by the C or AC key clears the storage memory
locations.

4. The MODE key: the keys on a calculator perform different operations


depending on the mode entered. Pressing the MODE key and depressing a
number from 1 to .  .  ., depending on the number of modes available on the
calculator, sets the specific mode required. The different calculation modes
are entirely independent and cannot be used in combination with each other.
Some MODES available on calculators are: normal calculations, standard
deviation calculations, regression calculations and interest calculations.

262

Statistics_Method_BOOK.indb 262 2014/12/18 3:01 PM


  Elementary calculations

5. The number of digits that can be entered into a calculator depends on the
size of the display, normally 10 digits. When the resulting answer exceeds
the display limit, the value is displayed in scientific notation. The display
reads as follows in these cases:
• 1.405 is 140 000. The 05 to the right of the value means that the decimal
point must move five places to the right.
• 1.4–04 is 0.00014. The decimal point is moved four places to the left.
6. The different keys to use for specific calculations will be mentioned when
the operation is dealt with in this unit. It is also recommended that you
keep your calculator’s owner’s manual or user’s guide accessible because
different calculators use different methods, which your trainer may not be
familiar with.

12.2 The number system


The number system we use today is known as the Hindu-Arabic number system.

12.2.1 Classification of numbers


Number system

Real numbers Imaginary numbers

Rational numbers Irrational numbers

Integers

Fractions

12.2.2 Real numbers and imaginary numbers


Numerical values can be classified as real numbers and imaginary numbers.

Imaginary numbers are the square roots of negative numbers, such as ​ 2 ​  ,
and it is difficult to attribute any real or practical significance to them. They form
part of the complex number system, which is beyond the scope of this text.
The number system that enables us to assign a number to every possible point
on a number line is called real numbers. A number line is a scaled straight
line with a zero (0) point, on which we can indicate the position of numbers

263

Statistics_Method_BOOK.indb 263 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

and their interrelations. Negative numbers are indicated to the left of the 0 and
positive numbers to the right of the 0. The number line can extend to infinity
with fractions or decimals between the whole numbers.
25 24 23 22 21 0 1 2 3 4 5

Real numbers consist of rational numbers and irrational numbers.

Activity 12.1

Fill in the missing numbers in each of the following rows:


66 67 71 73
3 9 27
30 40 60
100 120
36 48

12.2.3 Rational numbers and irrational numbers


a
A rational number can be expressed as the ratio or fraction ​ b ​  in which both
numerator and denominator are whole numbers or a decimal that repeats or
terminates, and b ≠ 0.
a
In the fraction ​ b  ​, a is called the numerator and b is called the denominator.

Division by 0 is undefined.
The three groups of rational numbers are:
1. Integers or whole numbers. For example, 6 is a rational number because it
6
can be written as ​ 1  ​(both numerator and denominator are whole numbers).
1
2. Finite or terminating fractions. For example: ​  4  ​is a rational number
because the decimal expression 0.25 is terminating or finite.
2
3. Recurring or repeated fractions. For example: ​  3  ​is a rational number
because the decimal expression 0.666… has a pattern that repeats or can
carry on forever. This recurring decimal can also be written as 0.6.

Irrational numbers are real numbers which cannot be expressed as the ratio of
two integers. The decimal value cannot be expressed with either a finite number

264

Statistics_Method_BOOK.indb 264 2014/12/18 3:01 PM


  Elementary calculations

of decimal places or a repeating pattern. The never-ending string of digits will


not form a pattern that continues to repeat itself. Any irrational number falls
between two rational numbers.

Example 12.1

Recurring decimals
 a
​ 5 ​ 5 2.2360679775… is an irrational number because it cannot be written as ​ b ​ 

and this never-ending string of digits will not form a pattern that continues to
repeat itself.

Activity 12.2

Classify the following numbers as rational or irrational.



1. ​ 2 ​ 

5
2. ​ 
7 ​ 

3
3. ​ 
13   ​ 

13
4. ​ 
19 ​ 

Note: The distinction between rational and irrational numbers is of very little
significance as far as practical applications are concerned. This is due to the fact
that any irrational number can be approximated (rounded) to any desired degree
of accuracy by means of a rational number.

12.2.4 Whole numbers and fractions


An integer is a positive whole number (1, 2, 3, …, also known as a natural or
counting number), a negative whole number (21, 22, 23, …) and the value
zero (0).The number zero is neither negative nor positive, and in that sense is
unique. The zero point is called the origin of the number system. Integers are
rational numbers, because an integer n can be considered as the ratio  ​ 1n  ​.
1 3 3
Fractions can be classified as proper fractions, such as ​  2 ​ , ​  7 ​  or ​  4 ​ , or improper
 
4 13 13
fractions, such as ​ 
2 ​ , ​  7   ​ or ​  4   ​, where the numerator is always bigger than the
 
denominator.

265

Statistics_Method_BOOK.indb 265 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

1 6 3
A mixed number, such as 1​  2 ​,  1​  7 ​ or 5​  4 ​,  is a whole number with a fraction.
 
Any mixed number can be turned into an improper fraction.
To change a fraction into a decimal, divide the numerator by the denominator.

Note: A natural number is a counting number beginning with 1.

Activity 12.3

In the table below tick () in the column(s) that correctly describe the number as
real, irrational, rational, integer, whole or natural. Use a calculator to help you.
Real Irrational Rational Natural Whole
number number number number number

​ 5 ​ 
54 876
216
8
​  5 ​ 

1.25
(​​  ​ 12 ​  )​​ ​
3

1
7 ​ 
3​ 

12.3 Common notation
Mathematical ‘shorthand’, or symbols, is often used in analysing and presenting
results rather than descriptive text.

Arithmetic symbols
1 add 2 subtract
3 multiply 4 or / divide
< less than # less than or equal to
> more than  more than or equal to
5 equal to ≠ not equal to
± plus/minus S sum of
 rounded as n! factorial

266

Statistics_Method_BOOK.indb 266 2014/12/18 3:01 PM


  Elementary calculations

12.4 Basic operations

12.4.1 Hierarchy in calculations


The order of operations is a set of rules that mathematicians have agreed to
follow to avoid mass confusion when simplifying mathematical expressions or
equations. When more than one calculation has to be made to find the solution
of a mathematical expression, you must follow a specific order of operations.
The following are the priority order of operations:
1. Exponents and roots are all treated as functions. Turn the functions into
numbers first.
2. Perform any calculations inside brackets ( ).
3. Do all powers and roots.
4. Complete all multiplication and division, working from left to right.
5. Perform additions and subtractions, working from left to right.

Note: To change the order of priority, brackets or the calculator can be used. Note
that the multiplication symbol ‘3’ is frequently omitted in some expressions. For
example: 6 3 (5 2 2) will normally be shown as 6(5 2 2).

Activity 12.4

Do the following calculations:


1. 2 3 6 1 3 2 4/2 2 5 1 20/5 3 3 1 50
2. (3 1 3 2 5)(15 2 5)10 2 log 99
3. 4 1 5 2 7 1 8 3 5 2 12 3 2/8 1 6 2 3 1 20 ÷ 2
4. 9 3 9 2 30 ÷ 3 1 5 2 6 1 7 2 2 1 9 3 9
5. [10 3 4 2 6 1 7 2 8/2 1 3 3 3 1 (4 1 5 2 6/3) 1 1]/2
6. [(3 1 4 2 6/2 1 2) 1 (9/3 1 6 3 5)/11)] 3 [(4 1 5 2 6) 1 (18 2 3 3 4)]/9
7. 2 1 3 3 15 2 10/2 1 (5 1 1)/3
8. 30/15 2 10
9. 30/(15 2 10)
10. (40 3 2 2 50/2) 3 (9/3)/100

12.4.2 Adding and subtracting (1 and 2)


Addition and subtraction are the most common of the fundamental operations
and are easy to perform. Adding and subtracting with speed and accuracy can

267

Statistics_Method_BOOK.indb 267 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

be achieved only through practice. Adding is to add more (or calculate the sum
of) and subtracting is to take away (or calculate the difference).

Note: The order of operations (adding or subtracting) does not matter.

Activity 12.5

1. Do the following calculations:


a) 28 1 5 1 3 1 6 5
b) 16 2 7 2 2 5
c) 7 1 46 2 15 1 3 5
2. The distance from Johannesburg to Polokwane is 340 km. The distance from
Johannesburg to Pretoria is 59 km. How far is it from Pretoria to Polokwane?
3. The following number of items had to be scrapped at the end of the day due
to damage: 21 shirts, 32 pairs of pants, 10 pairs of shoes and 53 jerseys.
How many items had to be scrapped?
4. Harry worked 15 hours of overtime in January, 12 hours of overtime in
February and four hours of overtime in March. What is the total number of
hours Harry has worked overtime?
5. There are 11 people in a taxi. At the next stop five people get off and nine get
on. How many people are now in the taxi?
6. You have 57 items in stock at the beginning of a shift. During the shift you
sell 33 items and receive a new delivery of 25 items. How many items will
you have in stock at the end of the shift?
7. If you want to buy a new suit for R599.99 and you have only R315 available
in your account, how much must you pay into your account to be able to
buy the suit?
8. If you left at 07h00 in the morning and arrived at 12h45 that afternoon,
how long did the journey take?

12.4.3 Multiplying and dividing (3 and 4)


The arithmetic operation to determine the product of numbers is called
multiplication (3).
Division (4) is the arithmetic operation that finds how many times one
number goes into another.

Note: The order in which you multiply and divide does not matter, but, if the

268

Statistics_Method_BOOK.indb 268 2014/12/18 3:01 PM


  Elementary calculations

calculation includes addition and subtraction, you must first calculate values
inside brackets, or multiply and divide, before you add and subtract.

Note: An alternative to the multiplication sign (3) is the multiplication point (),
a point that is set above the line and not to be confused with the decimal point.
An alternative to the division sign (4) is the right oblique (/), as used in
writing fractions.

Example 12.2

1. You have four boxes with 24 bars of soap in each and you want to know how
many bars of soap you have in total:
4 3 24 5 96
2. You must pack 100 items in boxes containing five items each and you want
to know how many boxes you need:
100 4 5 5 20 boxes

Activity 12.6

1. Do the following calculations:


a) 379 3 (215) 5
b) 69 4 13 5
c) 36 4 6 3 5 5
2. If a shop is open from 08h00 in the morning till 21h00 at night, how many
hours must each employee work if there are two shifts during the day?
3. Abel takes 10 minutes to unpack one box of shirts. If there are 25 boxes,
how long will it take him to unpack all the boxes? If there are eight shirts per
box, how much time does he spend to unpack each shirt?
4. Deon’s wage is R25 per hour. If he works eight hours per day for five days
and only four hours on Saturday, what is his weekly wage before tax?
5. If you buy nine pairs of socks at R14.99 per pair, how much change should
you get if you pay with a R200 note?
6. You need 250 g of flour to bake a loaf of bread. If you have a 5 kg packet of
flour, how many loaves of bread can you bake?
7. Your truck can take, at most, 4 tonnes at a time. If you want to move 38
tonnes of clothing to the warehouse, how many trips will you have to make?

269

Statistics_Method_BOOK.indb 269 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

If each trip takes 2 hours and 15 minutes, how many hours will you take to
move the clothing? How many tonnes will you take on each trip?
8. You want to put a carpet in the staff room. You have to pay R55 per m2. The
room is 11 m long and 6 m wide. How many square metres of carpet do you
need? How much will it cost you?

12.5  Signed numbers


1. When adding numbers of the same sign, find the sum of the numbers and
use the sign common to all factors.
2 1 3 1 4 5 19
(22) 1 (23) 1 (24) 5 29
2. When adding numbers of different signs, find the sum of the positive
numbers and the sum of the negative numbers, and then subtract the
smaller sum from the bigger one and designate the sign of the bigger sum.
(22) 1 3 1 4 1 (21) 5 17 2 3 5 14
2 1 (23) 1 (24) 1 1 5 13 2 7 5 24
3. When subtracting a negative number, change the sign of the negative
number being subtracted and add the number to the rest.
2 1 3 2 (24) 5 19
4. When multiplying or dividing by the same sign, the answer will always be
positive.
22 3 22 5 14
2 3 2 5 14
22 4 24 5 10.5
5. When multiplying or dividing unlike signs, the answer is always negative.
22 3 2 5 24
3 3 – 4 5 212
26 ÷ 2 5 23

Note: Use the (2) or 6 key on the calculator to change the sign of a number.

Activity 12.7

1. (13) 1 (22) 5
2. (22) 2 (13) 5
3. (24) 1 (27) 5
4. 0 2 (23a) 5

270

Statistics_Method_BOOK.indb 270 2014/12/18 3:01 PM


  Elementary calculations

5. (23) 1 (11) 5
6. (15) 3 (11) 5
7. (2xy)(21) 5
8. (212xy) 4 (24) 5
9. (12t)(2t) 5
10. (25p) 3 (12p) 5

12.6 Exponents (powers) (xy)


When a value is multiplied by itself some number of times, a superscript number
can be placed at the upper right-hand side of the value.
For example: 2 3 2 3 2 3 2 5 24 (read as two to the power of four). The
superscript number (4) is known as the exponent and 2 as the base. If a number
is raised to the power of 2, it is known as the square of the number.

Note: Use the power key xy or ˆ on the calculator to calculate the answer. Enter a
value for x, press the power key, and then enter a value for y.
2 xy 4 5 16

12.7  Square roots (​  )
The square root is the inverse of squaring. The root of a number is that quantity
which, when multiplied by itself, equals the number.

For example: ​ 25 ​ 5 5 (read as the square root of 25 equals 5) and 52 5 25.

Other roots – the third, fourth etc – are possible, and all roots can also be written
as fractional exponents. To convert a root to its exponential form the root is first
converted to its reciprocal and the quantity is then raised to that reciprocal power.

3  4  
For example: ​  16 ​ 5 161/3 or ​ 5 ​ 5 51/4 or ​ 25 ​ 5 251/2.

Note: Use the ​ 


x ​  or n​ 
x ​  key on the calculator. If a n​ 
x ​  key is not available on
your calculator, convert the root to a fractional exponent and use the power key
1
together with the reciprocal key ​  x  ​or x .
–1

Activity 12.8

Calculate the following expressions accurate to two decimal places:



1. ​ 8.4 ​ 5

271

Statistics_Method_BOOK.indb 271 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

7.51 2 2.76

2. ​​  3.25   
​ ​ 
5

3 
 65 ​ 5
3. ​

12.8 Logarithms (log)
The logarithm of a number is the power (exponent) to which a base number
must be raised to produce that number.
The common logarithm of a number is the exponent of 10, which equates to
that number. For example: the log of 1 000 is 3 because 103 5 1 000. All numbers
can be converted to a common base of 10. When the base is not indicated, it is
understood to be 10.
Logarithms which use e 5 2.718… as their base are called natural logarithms
and can be denoted by ln x. The anti-logarithm of the natural log ln is ex.

For example: ln 5 1.6094


e1.6094 5 5

Note: To calculate a log on the calculator press the log key and then the value.
On some calculators the value is entered first, followed by the log key. To obtain
the ex value press the ex key followed by the exponent. On some calculators the
exponent is entered first, then the ex key.

Activity 12.9

Calculate the following expressions accurate to the nearest hundredth:


1. log 340 5
2. 1 13.3 log 50 5
3. e3 5
4. e–12 5

12.9 Factorial notation (!)


The symbol n! is read as n factorial, and is a shorthand way of identifying the
product of all the positive numbers from 1 up to n.
For example: 5! 5 5 3 4 3 3 3 2 3 1
10! 5 10 3 9 3 8 3 7 3 6 3 5 3 4 3 3 3 2 3 1

272

Statistics_Method_BOOK.indb 272 2014/12/18 3:01 PM


  Elementary calculations

Notes:
• The symbol n! has no meaning if n represents anything other than a positive
whole number or zero. (23! is undefined)
• The value of 0! is defined to be 1. (0! 51)
• To obtain the factorial value from the calculator, enter the value followed by
the x! or n! key.

Activity 12.10

Do the following calculations:


1. 9! 5

13!
2. ​ 
10!  ​ 5
20!
8!12!  ​ 
3. ​  5

12.10  Sigma notation (S)


The Greek capital letter sigma (S) stands for ‘sum the appropriate values’.
n

Thus we write 1 1 2 1 3 1 4 1 … 1 n as ​   x  
​ ​ i​
i51

This means the sum of all the x values from 1 through n. This index system
must be used whenever only part of the available information is to be used. In
statistics, however, we will usually use all the available information and the
notation will be adjusted by doing away with the index system.
n


​  ​x

i51
i
​  ​will become Sx if all the data is used.

Activity 12.11

Use the x values 5 3 2 6 3 to calculate:


1. Sx 5
2. Sx2 5
3. (Sx)2 5

273

Statistics_Method_BOOK.indb 273 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

12.11 Fractions
A fraction is a number that can represent part of a whole and is denoted by:
a
​  b  ​if a and b are integers and b ≠ 0.

• The numerator (a) at the top tells you how many parts of the whole are
actually used.
• The denominator (b) at the bottom tells you how many parts the whole has
been divided into. If the numerator is larger than the denominator, we have
an improper fraction. If the numerator is smaller than the denominator, we
have a proper fraction.
• A horizontal line drawn between the numerator and denominator separates
them. This horizontal line indicates that the numerator value must be divided
by the denominator value to find a single numerical result. A decimal can
thus be found by dividing a numerator by a denominator.
• Converting the fraction to a decimal before performing the arithmetic
operation can save computational time.

Example 12.3
3
The fraction 
​ 4 ​ indicates 3 is to be divided by 4 as follows:
3
3 4 4 5 0.75 or  ​ 4 ​ 5 0.75

Rules governing fractions


• Any fraction in which the denominator equals the numerator has a value of one.
8
8 ​ 5 1
​ 

• Division into 0 is mathematically undefined and the fraction will always equal 0.
• When multiplying or dividing both the numerator and the denominator by
the same value, the fraction value does not change.
{  5
​  12 ​ . 
​ }
​  12 ​   ​
​  5  ​ 5 

• A fraction can be reduced to its lowest terms by dividing both the numerator
and denominator by a common factor.
{  5 4 5
​  10 4 5  ​ 
​ 5​  12 ​   ​ }
• To add or subtract fractions with the same denominator add or subtract the
numerators and write the sum over the common denominator.
3
​  15 ​ 5 
​  5 ​ 1 
 ​  45 ​ 

274

Statistics_Method_BOOK.indb 274 2014/12/18 3:01 PM


  Elementary calculations

• To add or subtract fractions with different denominators, a common


denominator is found by multiplying the denominators and restating each
fraction with the common denominator.
5 1 5(3) 1 1(8) 23
8 ​ 1 ​  3 ​ 5 ​ 
​   
24  ​ 
 5​  24 ​ 

• To multiply fractions multiply both the numerators and denominators.


{ 2
​  5  ​ 3 
​
3
​  7 ​ 5 
6
}
​  35   ​   ​

• To multiply a fraction by a whole number multiply the numerator by the whole


number, maintaining the same denominator.
{​  3 . ​ 12 ​ 5 ​ 32 ​ 5 1​ 12 ​  }​
• When dividing by a fraction, inverse the fraction in the denominator and then
multiply the numerator by the inverted fraction.

{  }
5
​     ​
 
5 20
​  12
​ 3   ​   ​  43 ​ 5 
​  12   ​ 3  ​  36 ​   ​
​  4 ​
 

Note: Fraction calculations can be done on the calculator if the fraction key ab/c
is available.

Activity 12.12

1. Convert the following fractions to their decimal equivalents. Round your


answer to two decimal places.
23 3
​  13  ​ ​ 

4
8   ​  3​  19    ​ ​  11    ​
 
2. Convert the following decimals to fractions:
0.11 0.135 0.1567 2.1723 0.07
3. Susan earns an hourly wage of R25 per hour. If she works overtime, she gets 1
1
times her wage for the first five hours and 1​  3  ​after the first five hours of overtime.
a) What is her hourly wage for the first five hours of overtime?
b) What is her hourly wage if she works longer than five hours of overtime?
c) How much does she earn before tax if she works 13 hours overtime?
4. Give the following answers in fractions:
1 3
a) ​  3  ​ 1 ​  4  ​ 5

5 1
b) ​ 
20   ​ 2 ​  5 ​ 5

2 3
c) ​ 
6 ​ 3 ​  9 ​ 5

275

Statistics_Method_BOOK.indb 275 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

5 1
d) ​ 
24   ​ 4 ​  6 ​ 5

12.12  Decimal numbers


Each digit in a number has a place and a value. The location of a digit within a
number is called its place. Place value is the value of the digit based on its location
within a number group. Our numerical (and monetary) system is based on the
10 number (decimal) metric system and therefore decimal fractions always have
a power of 10 in the denominator (e.g. 10, 100, 1 000).
• A fraction can be converted to a decimal if the numerator is divided by the
denominator.
• Decimals have a decimal value relative to their position in relation to the
‘decimal point’ in a number.
• Numbers to the left of the decimal point are whole numbers and numbers to
the right of the decimal point have values of tenths, hundredths, thousandths
etc of a whole number.

Example 12.4

The following example shows the place name of each digit :


5 4 3 2 3 6 4 . 3 1 0 8 7 5

Hundred thousandths
Hundred thousands

Ten thousandths
Ten thousands

Thousandths
Hundredths
Thousands

Millionths
Hundreds

Decimal
Millions

Tenths
Units
Tens

M Thousands Ones . Decimals

There are 5 millions (5  000  000), 4 hundred thousands (400  000), 3 ten
thousands (30  000), 2 thousands (2  000), 3 hundreds (300), 6 tens (60),
4 units (4), 3 tenths (0.3), 1 hundredth (0.01), 0 thousandths (0.000), 8
ten thousandths (0.0008), 7 hundred thousandths (0.00007), 5 millionths
(0.000005).

276

Statistics_Method_BOOK.indb 276 2014/12/18 3:01 PM


  Elementary calculations

Activity 12.13

Complete the following:


832 means: 2 units, 3 tens and 8 hundreds
611 means:
1 093 549 means:
3.026 means:
522.034563 means:

Example 12.5

The number 13.435 is shown in the figure below:


1 1 1
1000s 100s 10s Units • ​ 
10   ​  ​  100
    ​  ​  1000
    ​ 

0 0 1 3 • 4 3 5

The number is interpreted as:

4 3 5
0 thousands plus 0 hundreds plus 1 ten plus 3 units plus ​ 
10   ​ plus ​     ​ plus ​ 
100    ​  
1 000
435
5 13 
​ 1 000  ​ 
or 13.435

Note: The two zeros in the example representing the first two cells are usually
not indicated as leading numbers. They have simply been put into the example
to indicate the positions that the different numbers take up in the value 13.345
from a decimal perspective.

12.13  Scientific notation


There is an accepted notation for writing very large and very small numbers.
This allows you to write down and compare big and small numbers more
conveniently.
• The form of the notation is a product of two factors.
• The one factor has only one digit to the left of the decimal point (or the number
lies between 1 and 10).
• The other factor is a power of 10.

277

Statistics_Method_BOOK.indb 277 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 12.14

Scientific notation:
1. 76 591 5 7.6591 3 104
2. 7 000 000 5 7 3 106
3. 0.0238 5 2.38 3 10–2
4. 0.00006 5
5. 891 245 5
6. 6 million 5

Activity 12.15

What are the decimal numbers?


1. 3.12 3 103 5
2. 24 3 10–3 5
3. 1.2 3 10–6 5
4. 4.789 3 105 5
5. 5 3 10–2 5

12.14 Rounding off decimals


It is often desirable to round numbers to make them easier to understand and
use. Round numbers after completing arithmetic operations and not before, and
ensure that results are acknowledged as such.
• Draw a line after the digit you want to round – for example, if you want to
round off the number to two decimal places, you draw a line after the second
decimal.
• If the digit to the right of the desired rounding digit has a value of 5 or more,
increase the rounding digit by 1 and replace succeeding digits with zeros if it
is a whole number, or disregard all numbers to the right if decimals.
• Should the value to the right of the desired rounding digit be less than 5, leave
the rounding digit as it is and disregard all numbers to the right of the cut-off
digit, or replace with zeros if it is a whole number.
• If the digit to the right of the desired rounding place is the last digit and exactly
5, increase the rounding digit by 1 if it is an odd number and leave as is if the
number to its immediate left is an even number.
• Often a calculation results in more decimal places than are necessary. There

278

Statistics_Method_BOOK.indb 278 2014/12/18 3:01 PM


  Elementary calculations

may not, however, be more decimals to the right of the decimal point than in
the values being processed – that means you can’t be more accurate than the
least accurate value of the given data. To correct such an instance requires
a ‘rounding off ’ to the appropriate number of decimal places, for example
1.2 3 0.54 5 0.648 ≈ 0.6 (the least accurate value in the original data is 1.2;
therefore the answer must be rounded to the nearest tenth).

Example 12.6

1. Round 169 to the nearest 10:


The desired rounding place is 6 (16|9).The digit to the right of 6 is more
than 5; therefore round up by increasing 6 by 1 and change all digits to the
right of 6 to 0. 169 will become 170.
2. Round 1 819 to the nearest 100:
The desired rounding place is 8(1  8|19).The digit to the right of 8 is less
than 5; therefore leave 8 as it is and change all digits to the right of 8 to 0.
1 819 will become 1 800.
3. Round 33.215 to the nearest tenth:
The desired rounding place is 2 (33.2|15).The digit to the right is less than
5; therefore leave 2 as it is and disregard all the digits to the right of 2.
33.215 will become 33.2.
4. Round 5.129 to the nearest hundredth:
The desired rounding place is 2 (5.12|9). The digit to the right of 2 is more
than 5; therefore round up by increasing 2 by 1 and change all digits to the
right of 2 to 0. 5.129 will become 5.130.
5. Round 17.5 to the nearest unit (whole number):
The desired rounding place is 7 and the digit to the right of 7 is the last digit
and exactly 5; therefore increase 7 by 1 because 7 is an odd number. 17.5
will become 18.
6. Round 19.985 to the nearest hundredth:
The desired rounding place is 8 and the digit to the right of 8 is the last digit
and exactly 5; therefore leave 8 as is because 8 is an even number. 19.985
will become 19.98.

279

Statistics_Method_BOOK.indb 279 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 12.16

1. A company reports a profit figure last year of R1  078  245.67. Round this
figure to the:
a) nearest million
b) nearest thousand
c) nearest unit
d) nearest tenth
2. Round 539.345 to the nearest hundredth.
3. Round 4.2355 to the nearest thousandth.
4. Round 5.009 to the nearest hundredth.

12.15  Significant digits


Significant refers to the number of digits in the number that are accurate
– counting from left to right. Rounding can also be done by the number of
significant figures we require.
• Start with the non-zero digit (e.g. the ‘1’ in 1 300, or the ‘2’ in 0.0274) on
the left.
• Keep the required number of digits and replace the rest with zeros.
• Round up by one if appropriate. For example, if rounding 0.059 to one
significant figure, the result would be 0.06.

General rules for determining the number of significant digits in a number


1. All non-zero numbers are significant. For example, the number 163.45 has
five significant figures: 1, 6, 3, 4 and 5.
2. All zeros between significant numbers are significant. For example, the
number 5  002 has four significant figures and 301.12 has five significant
figures.
3. A zero after the decimal point is significant when bounded by significant
figures to the left. For example, the number 3  002.0 has five significant
figures, 15.2300 has six significant figures, 0.00152300 has six significant
figures, 120.00 has five significant figures. If a result accurate to four
decimal places is given as 12.23, then it might be understood that only
two decimal places of accuracy are available. Stating the result as 12.2300
makes clear that it is accurate to four decimal places.
4. The significance of trailing zeros in a number not containing a decimal point
can be ambiguous. For example, it may not always be clear if a number like

280

Statistics_Method_BOOK.indb 280 2014/12/18 3:01 PM


  Elementary calculations

1 300 is accurate to the nearest unit (and just happens coincidentally to be


an exact multiple of a hundred) or if it is only shown to the nearest hundred
due to rounding. One method to address this issue is to underline the last
significant figure of a number; for example, ‘80  000’ has two significant
figures.
5. Zeros to the left of a significant figure and not bounded to the left by another
significant figure are not significant. For example, the number 0.01 only has
one significant figure and 0.00012 has two significant figures.
6. A number with all zero digits (e.g. 0.000) has no significant digits.

Example 12.7

1. Round 742.396 to:


a) four significant digits: 742.400
b) three significant digits: 742.000
c) two significant digits: 740.000
2. Round 0.06284 to:
a) four significant digits: 0.06284
b) three significant digits: 0.0628
c) two significant digits: 0.063
3. Round 351.45 to:
a) four significant digits: 351.4
b) three significant digits: 351
c) two significant digits: 350
4. Round to two significant figures:
a) 13 300 becomes 13 000
b) 14 stays 14
c) 0.00123 becomes 0.0012
d) 0.4 becomes 0.40 (trailing zero indicates rounding to two significant
figures)
e) 0.01084 becomes 0.011
f ) 0.0325 becomes 0.032
g) 19 800 becomes 20 000 (see the notes about trailing zeros)

281

Statistics_Method_BOOK.indb 281 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 12.17

Your expense budget for the year amounts to R125 784.66.


The exact budget figure contains eight significant digits. Round this number
to:
1. one significant digit
2. two significant digits
3. three significant digits
4. six significant digits
5. seven significant digits

Round to the appropriate number of significant digits when you add or


subtract
Add (or subtract) the numbers as usual, then round the answer to the same
decimal place as the least-accurate number.

Example 12.8

1. 13.214 1 234.6 1 7.0350 1 6.38 5 261.2290


The second number, 234.6, is only accurate to the tenths place, so the
answer will have to be rounded to the tenths place:
13.214 1 234.6 1 7.0350 1 6.38 5 261.2

2. 1 247 1 134.5 1 450 1 78 5 1 909.5


450 is only accurate to the tenths place; therefore round the final answer to
the nearest tenths place:
1 247 1 134.5 1 450 178 5 1 910

Activity 12.18

Calculate the following:


1. 9.812 2 0.13358 1 0.123 5
2. 1.111 2 0.234 1 0.001 5

282

Statistics_Method_BOOK.indb 282 2014/12/18 3:01 PM


  Elementary calculations

Round to the appropriate number of significant digits when you multiply


(or divide)
Multiply (or divide) the numbers as usual, then round the answer to the same
number of significant digits as the least-accurate number.

Example 12.9

1. If we multiply 3.3 (rounded to two significant digits) by 3.55 (rounded to


three significant digits), the answer is 11.715. This answer appears to have
five significant digits; however, the result cannot be more accurate than the
lowest significant level of the two numbers, which is two. The result should
be 12.
2. 00435 3 4.6 5 0.02001
4.6 has only two significant digits, so round 0.02001 to two significant
digits.
0.00435 3 4.6 5 0.020
(The answer is not 0.02 because this is only one significant digit (the ‘2’).
The trailing zero indicates that ‘this is accurate to the thousandth place’ and
is therefore a necessary part of the answer.

Activity 12.19

Calculate the following:


1. 16.235 3 0.217 3 5 5
2. 0.235 3 0.0070 3 1.333 5

Note: For adding, use ‘least accurate place’.


For multiplying, use ‘least significant digits’.

12.16 The metric system


This is the most widely used system of weights and measures in the world; a
standardised system is necessary for computational purposes in international
trade.
The basic units of measurement in the metric system are:
• length, which is measured in metres

283

Statistics_Method_BOOK.indb 283 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

• mass, measured in grams


• volume, measured in litres.
The metric system is decimal-orientated and deals with powers of 10. Thus,
multiplying or dividing by 10 gives the next higher or lower unit.
0.001 milli 0.01 centi 0.1 deci 1 metre 10 deca 100 hecto 1000 kilo
1 litre
1 gram

Activity 12.20

Fill in the missing numbers:


kilo hecto deca metre/ deci centi milli
litre/
gram
1 000 100 10 1 0.1 0.01 0.001
6 335
1 250
50
150
1 025
96

284

Statistics_Method_BOOK.indb 284 2014/12/18 3:01 PM


UNIT Percentages and

13 ratios

This unit deals with applying the concept of percentage and ratio calculations
in business.

After completion of this unit you will be able to:


• convert percentages to fractions and decimals
• deal with different types of percentage problems
• identify the base
• understand the concept of ratios
• apply percentages and ratios in business.

‘Percentage’ is derived from the Latin per centum meaning ‘per hundred’ and uses
the symbol %. It is a universal basis for comparison whereby a value is expressed as
to how much of 100 such a value represents. The basis of comparison is therefore
always 100. Once a value is expressed in terms of its portion of 100, the result is
indicated as a percentage by adding the percentage sign ‘%’ after the result.

13.1  Percentage calculations


Percentages are used widely in business to determine discounts, taxes, interest
and numerous comparisons. To use a percentage in an arithmetic application it
must first be changed to a decimal or a fraction.

Commonly used terms


• Base: the value upon which the percentage is taken.
• Rate (%): the percentage that is taken of the base.

13.1.1 Converting percentages to fractions and decimals


To change a percentage into a fraction or a decimal, divide the number
expressing the percentage by 100 and drop the % sign. The percentage becomes
the numerator and 100 the denominator. The fraction can be converted to a
decimal by dividing the numerator by the denominator.

Statistics_Method_BOOK.indb 285 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 13.1

Express 75% as a fraction and a decimal:

75
75% 5 
​  100  ​ 5 0.75

Activity 13.1

Find the fraction and decimal equivalents of:


a) 44%
b) 83%
c) 126%

13.1.2 Converting a fraction or decimal into a percentage


To change a fraction or decimal into a percentage, multiply by 100 and add the
% sign.

Example 13.2
6
1. Express ​  20   ​ as a percentage:
6
​  20    ​ 3100 5 30%

2. Express 0.14 as a percentage:


0.14 3 100 5 14%

Activity 13.2
1
1. Express ​ 
20   ​ as a percentage.
2. Express 0.033 as a percentage.

286

Statistics_Method_BOOK.indb 286 2014/12/18 3:01 PM


  Percentages and ratios

13.1.3 Finding the percentage amount


To find a percentage amount of a value if both the base and the rate are known
you must first convert the rate to a fraction or a decimal number and then
multiply by the value, which is the base.
rate
percentage amount 5 base 3  ​  100   

Example 13.3

Calculate 5% of R200:
5
​  100
 ​ 3 R200 5 R10
    

Activity 13.3

1. Calculate 14% of 430.


2. Calculate 5% of 684.

13.1.4 Finding the rate


To determine the percentage rate when both the base and the percentage amount
are known, you must first construct a fraction using the percentage amount as
numerator and the base as the denominator.
percentage amount
rate (%) 5 ​ 
  
base  ​ 
3 100

Example 13.4

What % is 8 of 18?
8
​  18   ​ 3 100 5 44.44%

Activity 13.4

1. What percentage is 9 of 18?


2. What percentage is 3 of 67?

287

Statistics_Method_BOOK.indb 287 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

13.1.5 Finding the base


If the rate and the percentage amount are known, the base can be determined by
dividing the percentage amount by the rate.
percentage amount
base 5 
​ 
   rate  ​ 
3 100

Example 13.5

If 14% of a value equals 90, what is the value?

90
​ 
14 ​ 3 100 5 642.86

Activity 13.5

If you receive R4 600 simple interest on an investment earning 9%, how much
did you invest?

13.1.6 Percentage rate increase or decrease


Percentage rate increase or decrease is the difference between the values, divided
by the base value.
increase
• percentage increase 5 ​  base value ​ 3 100
  

decrease
• percentage decrease 5 
​  base value 
 ​ 
3 100

Example 13.6

The daily sales have increased from R5 000 per day to R5 500 per day – that is a
difference of R500. The percentage rate increase is:

500
​ 
5 000  ​ 
3 100 5 10%

288

Statistics_Method_BOOK.indb 288 2014/12/18 3:01 PM


  Percentages and ratios

Activity 13.6

When the tenants decided to vacate the premises they expected to receive their
entire deposit of R2  500, but instead received only R2  000. What percent of
their deposit was kept by the landlord?

13.1.7 Finding the amount


Some problems require the percentage amount to be added or subtracted to the
base to find the amount.

Example 13.7

1. If you earn R500 a week and you have 15% payroll deductions, how much
do you take home per week?
15
Your deductions are: 
​ 100   
​ 3 500 5 75
You take home: R500 2 R75 5 R425

Alternatively, you can say if the deductions are 15%, you take home 85%
of R500.

85
​  100  ​ 3 500 5 R425

2. This month’s sales exceeded those of last month by 12%. If last month’s
sales were R26 521, calculate this month’s sales.
Percentage amount with last month’s sales as base
12
​  100
5   ​ 3 26 521 5 R3 182.52

This month’s amount 5 R26 521 1 R3 182.52 5 R29 703.52

Alternatively we can say that if last month’s sales are the base (100%), then
this month’s sales would be 112% (that is 100% 1 12%).

​  112
This month’s amount 5  100  ​ 3 26 521 5 R29 703.52

289

Statistics_Method_BOOK.indb 289 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 13.7

1. The total mass of a packaged article is 25% greater than its nett mass of
6 kg.
Determine the total packaged mass.
2. If an article weighs 9  kg after it is packaged, and the increase in mass is
20%, how much did the article weigh before packaging?

13.2 Ratio (proportion) calculation


A fraction and a ratio are different ways of expressing the same relationship.
A ratio is a comparison of two quantities expressed in the same measurement
a
units and may be written as follows: a to b or a: b or ​ b ​. 
Ratios are also used to express rates such as 120 km/h, 2.3 children/family,
14 km/litre of petrol.

Example 13.8

1. In comparing two drums of paint, one with a volume of 20 litres and the
other a volume of 50 litres, we can say that the ratio between the two drums
is 20 : 50 or 2 : 5. That means that the small drum has  ​ 25 ​ the volume of the
big drum, which is also  ​ 25  ​ 3 100 5 40%. Alternatively we can say that the
5
volume of the big drum is  ​ 2  ​ 5 2.5 times more than the small drum or 2.5 3
100 5 250% more.
2. If examinations normally result in a failure rate of 7 per 200 students,
the number of failures that can be expected if 800 students write the
examination is:
7
​  200
 ​ 3 800 5 28 students
    

Activity 13.8

1. Three friends decided to contribute R20, R10 and R5 respectively to buy


tickets from the national lottery. Their agreed division of winnings will be
in the same ratio as their contributions. If their winnings amount to R500,
what sum of money will each one receive?
2. A golf player receives 10% of his income from sponsorships, three times as

290

Statistics_Method_BOOK.indb 290 2014/12/18 3:01 PM


  Percentages and ratios

much from training, and the rest from tournaments. If his total income for
the year is R340 000, how much did he get from each source?

13.3 Business applications
Although percentages have many applications in many disciplines, in the
manufacturing, retail and wholesale environment they are usually applied to
pricing of goods or services, to determine final prices after adding profit margins
or allowing for discounts. Percentages can also be used in stock control levels.
The cost price is the price a wholesaler or retailer paid for a product or service
excluding the VAT, or the cost to the manufacturer of manufacturing the product
from scratch.
The selling price is the price for which a product or service is sold.

13.3.1 Mark-up on cost price


The manufacturer sells his product to a retailer, wholesaler or final consumer
and will want to do so at a profit. A ‘mark-up’ margin must therefore be added to
the product to determine the selling price of the item.
The mark-up margin is usually expressed as a percentage to be added to the
cost price.

Example 13.9

ABS Manufacturers produce classroom desks used in schools. The cost price of a
desk is R120.00 and ABS adds a mark-up of 40% to this price to determine the
selling price per unit.
The selling price per unit is therefore: R120.00 1 ​ ​  (  40
)
​ 3 120  ​ 5 R168.00
100   
ABS will now sell these desks to Tablecor (Pty) Ltd at R168.00 1 VAT per desk.
A desk will cost Tablecor R168.00 1 ​  (  14
​  100 )
​ 3 168  ​ 5 R191.52
   
Assume that Tablecor decides to add its own profit or mark-up. Tablecor now
sells the desks to the Education Department for use in schools. The cost price of a
desk for Tablecor is R168.00 (because they reclaim the VAT that they have paid
as input VAT). Tablecor uses a mark-up percentage of 20%.
The price at which Tablecor will then sell each desk is:
R168.00 1 ​  (  20
​  100    )
​ 3 168  ​ 5 R201.60, excluding VAT.
Adding VAT to the selling price means that the final consumer will pay:
R201.60 1 ​  (  14
​  100    )
​ 3 201.60  ​ 5 R229.82 per desk.

291

Statistics_Method_BOOK.indb 291 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Activity 13.9

Calculate the following without taking VAT into account:


1. An article costs R32. It is sold at a profit of 15%. Find the selling price.
2. By selling an article for R63.50, a profit of 25% is made. What is the cost
price?
3. An article costs R25 and is sold at a profit of R3. What is the % profit?
4. An article costs R250 and is sold for R300. What is the % profit?

13.3.2 Mark-downs and discounts


While a mark-up means adding to a base price, mark-down means reducing a
base price. A mark-down differs from a discount in the sense that a discount
is a reduction in selling price because of method of payment (cash) or due to
volumes purchased or early settlement of debtor accounts, while a mark-down
is a reduction in a current selling price to a new and lower selling price.
Mark-downs are used when vendors believe that their selling prices are too
high for customers. To attract more customers and increase sales, vendors will
then reduce or ‘mark down’ the prices of their goods or services.

Example 13.10

A carpenter is eligible for a 15% trade discount on all purchases from a wholesaler.
If R2 400 was the total list price of goods purchased, how much did the goods
cost the carpenter?
( 
R2 400 2 ​ R2 400 3 
15
)
​  100   
​   ​5 R2 040

Activity 13.10

1. A tracksuit is advertised at R177.99, reduced from R199.99. What is the


mark-down percentage?
2. A department store is selling a suitcase for R153, after a 10% discount.
What was the original selling price?

292

Statistics_Method_BOOK.indb 292 2014/12/18 3:01 PM


  Percentages and ratios

13.3.3 Value-added tax calculations


The obligation by qualifying vendors to pay value-added tax (VAT) is regulated
by the Value-Added Tax Act No 89 of 1991, as amended. This Act stipulates that
a tax, known as value-added tax, shall be levied:
• on the supply by any vendor of goods or services in the course of the
furtherance of his or her business
• on the importation of any goods into the Republic of South Africa by any
person
• on the supply of any imported services by any person.
The tax rate is currently set at 14% on the value of the goods or services as
applicable.
A vendor is any person (natural or legal) who is required to register as a vendor
under the Act.

Example 13.11

ABC Stores sells 20 cases of soft drinks to a customer. ABC applies the VAT
excluded method to charge VAT. That means that VAT is not part of the marked
price, but is added on at the end of the invoice. The price per case is R120.00 and
there are no discounts.
1. How much does the purchase cost excluding VAT?
20 cases 3 R120.00 per case 5 R2 400.00
2. How much VAT is charged on this purchase?
14
R2 400 purchase amount 3 14% VAT 5 R2 400 3  ​  100 ​ 5 R336.00
   
3. How much must the customer pay ABC Stores?
R2 400 purchase amount 1 R336 VAT 5 R2 736.00

Example 13.12

Smart Stores advertises garden furniture sets at R3 200.00 per set (VAT inclusive).
1. What is the price of a set exclusive of VAT?
The price quoted includes VAT and therefore equates to 100% 1 14% 5 114%
3 200
Price without VAT 5  ​  114   
​ 5 R2 807.02

2. How much VAT is charged per set?
VAT charged per set 5 R3 200 2 R2 870.02 5 R392.98

293

Statistics_Method_BOOK.indb 293 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Example 13.13

Sarah purchases a hair dryer from Nice Appliances for R110.00 excluding VAT.
She negotiates a 10% discount for cash payment.
1. What does she pay for the hair dryer before VAT?
100% less 10% discount 5 90%
90
110 3 ​  100  ​ 5 R99.00
2. What is the discount amount?
R110.00 2 R99.00 5 R11.00
3. How much VAT does she pay?
14
R99.00 3  ​  100  ​ 5 R13.86
4. What is the total invoice amount?
R99.00 1 R13.86 5 R112.86

Activity 13.11

Spaza Store receives a delivery of 12 cases of washing powder from Hygiene


Distributors. The tax invoice indicates a total amount of R2 736.00 (inclusive
of VAT) to be paid to Hygiene Distributors. Upon checking the consignment,
the owner of Spaza Store finds that three cases have become wet and therefore
damaged. He returns the three cases to Hygiene and requests a credit for them.
1. How much does Spaza pay per case?
2. How much VAT was charged by Hygiene on the original invoice?
3. How much credit must Hygiene grant Spaza (VAT excluded)?
4. How much VAT is involved in the credit amount?

TEST YOURSELF 13

1. At the end of 2001 there were 101 stores open in South Africa – 39 in
Gauteng, 33 in Cape Town, 19 in Natal and 10 in the Free State. Find the
percentage of each in relation to the total.
2. Each section in a department store is given a target for the year, with Jack’s
section targeted for an increase of 25% over last year’s results. If last year’s
sales were R1.5 million, what was Jack’s targeted sum?
3. A salesman’s commission makes up 13% of his total weekly income. If his
commission is R948 for a particular week, what is his total income?

294

Statistics_Method_BOOK.indb 294 2014/12/18 3:01 PM


  Percentages and ratios

4. In the past 10 years, employment in a company has fallen by 780 to 3 240.


What is the percentage decline in employment over the decade?
5. Your commission of R475 for the week is equivalent to what rate of
commission if the commission is calculated on weekly sales of R5 000?
6. With a trade discount of 30%, an electrician paid R83.30 for materials.
What would the material have cost without the trade discount?
7. Mr Hammer, a carpenter, receives a trade discount of 20% and a cash discount
of 3% at the local hardware store. The list price for materials he purchased
totalled R4 532.50. Find the invoice price and the actual amount paid cash
for the material. What percentage of the list price did Mr Hammer save?
8. Giftware often carries a mark-up cost of 50%. If the cost of a vase to a
retailer is R84, what will the vase retail for?
9. A pair of shoes priced at R793 has been marked up on cost by 30%. What
was the cost price to the retailer?
10. The cost to the retailer of a tennis racquet that sold for R165 was R110.
Find the profit as a percentage based on cost, and the profit percentage based
on the list price.
11. The cost of a garment to a boutique was R400. If a profit margin of 17%
was made on the retail price, find the retail price. If a loss of 2.5% was made
on the retail price, what was the retail price?
12. All candles in a particular gift shop are priced at R14.25 after a mark-up on
cost of 25%. What was the cost of the candle to the proprietor?
13. A dealer marks all baby goods 25% above cost. What percentage discount
can he allow during a sale to ensure a profit of at least 10%?
14. Goods are bought for R30. At what price must they be marked in order to
yield 10% profit after a trade discount of 10% has been allowed?
15. A suit is marked at R999.99. A trader allows 2.5% discount and still makes
20% profit. What did the suit cost the trader?
16. You purchase a hi-fi set and a DVD player for R3 400 and R2 800 respectively.
How much VAT in total will you have to pay on these transactions?
17. Sipho buys a television set from Grand Bazaars for R4 600. Because he pays
cash for it, he is given a discount of 15% on the purchase.
a) How much does Sipho pay for the TV before VAT is charged?
b) How much discount did Sipho get?
c) How much VAT is charged on this transaction?
d) What is the final amount that Sipho pays for his TV set?
18. Smart City Appliances receives a delivery of 25 refrigerators from Freezer
Manufacturers together with an invoice for R102  600. Upon inspection

295

Statistics_Method_BOOK.indb 295 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Smart City returns four refrigerators because they are damaged and requests
a credit for the returned goods. Freezer has also granted Smart City 10% trade
discount on the order. The discount has already been included in the invoice.
a) How much did a refrigerator originally cost?
b) How much discount did Smart City get per refrigerator?
c) How much VAT is included in the original invoice?
d) How much credit must Smart City get, excluding VAT?
e) By how much must Freezer adjust the VAT charged?
f) How much will Smart City now have to pay Freezer Manufacturers?
19. Smart Stores sells fashion clothing directly to the public. The prices of all
items are inclusive of VAT. The price reflected on the price tag of a garment
is the price to be paid. Agnes purchases three dresses at R160.00 each and
a track suit for R210.00.
a) How much does she have to pay Smart Stores for the purchases?
b) What were the prices of the dresses and track suit before Smart Stores
added the VAT?
c) How much VAT did she pay on the whole transaction?
d) How much will Agnes have to pay if Smart Stores grants her 10%
discount on the dresses and 8% discount on the track suit?
e) How much VAT will Smart Stores add to the transaction if they grant the
discounts above?
20. Eight slabs of chocolate cost R32. Find the cost of three slabs of chocolate.
21. John takes 30 minutes to walk from his home to school at a speed of 4 km/h.
How long will he take if he cycles at 10 km/h?
22. A lecturer takes three hours to mark the books of all the students in her
class. How long will it take three lecturers to mark the same books if they
work at the same pace?
23. It takes three markers 120 hours to mark students’ examination scripts.
Assuming they all work at the same pace, calculate how long it will take if
there are:
a) 6 markers
b) 10 markers
c) 20 markers.
24. If I travel at 50 km/h, I can do a journey in 6 hours. How long will it take the
same journey at 40 km/h?
25. A farmer buys enough chicken feed to last 200 chickens for a week. How
long will the same amount of feed last for 350 chickens? (Each chicken eats
the same amount each day.)

296

Statistics_Method_BOOK.indb 296 2014/12/18 3:01 PM


UNIT Equations and graph

14 construction

In this unit we look at solving equations and ways to make this easier.

After completion of this unit you will be:


• familiar with the concept of graph construction
• familiar with business applications using linear equations.
One of the most important concepts of mathematics concerns the relationship
between the elements of sets. In this unit we are mainly concerned with
relationships between two sets of numbers such as numbers representing supply
and demand, age and value of machinery, unit cost and the number of units
produced and so on.

14.1 Graph construction
A graph shows a picture of the trend or relationship between two variables (x
and y), that is, how one quantity changes with respect to another.
The type of graph to be drawn depends on the type of data, the complexity
of the data and the requirements of the user. In this text we deal with the linear
graph only.
Two variable functions are graphed on a set of rectangular coordinate axes.
The plane formed by the coordinate axes is called the Cartesian plane. In order
to set up a Cartesian graph the following steps must be followed:
• Two lines, known as coordinate axes, are drawn at right angles dividing the
plane into four quadrants. The point where the two lines cross is known as the
origin (0).The horizontal line is known as the x axis and the vertical line as
the y axis.
• Indicate units of length or a scale on the two axes (not necessarily the same
for each one).To select a scale determine the maximum and minimum
numbers you will use for each variable and subdivide the axis in multiples
of, for example, 1, 2, 3, 5, 10 and 100 as necessary to accommodate the

Statistics_Method_BOOK.indb 297 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

maximum and minimum number of each variable. To the right of the y axis,
x is positive and to the left it is negative. Above the x  axis, y is positive and
below it is negative.
• All values along the x axis are known as abscissas and are plotted below the
x axis.
• All the values along the y axis are known as ordinates and are shown to the
left of the vertical axis.
• The graph should have a title and both the axes should be labelled.
• Any point in the Cartesian plane is defined by an ordered pair of coordinates
(x, y),with the value of x always given first.
• A mathematical function assigns one value of y to each value of x within its
equation and by arbitrarily selecting values for x a corresponding value for y
can be computed with a resulting set of ordered paired coordinates (x, y).
• Each of these pairs of coordinates corresponds to a point on the Cartesian
plane and if we plot all the points, we obtain the graph of the function.

For example, the coordinate point (1, 4) is exactly 1 unit to the right of the 0
along the horizontal line and 4 units above the 0 on the vertical line.
y
3,7
7
6
2,5
5
4
1,4
3
2
1
x
23 22 21 1 2 3
21
22
23

Note: The Cartesian plane has four quadrants. While all are mathematically
important, we find that in business the most important of the four is the top
right quadrant, where both x and y have positive values.

298

Statistics_Method_BOOK.indb 298 2014/12/18 3:01 PM


  Equations and graph construction

14.2  Solution of equations


An equation is a shorthand way of stating that two algebraic expressions are
equal. It has a left-hand side that is equal to a right-hand side. To solve an equation
the value of the unknown (variable) must be calculated. If it is substituted into
the equation, the value on the left must equal the value on the right.

14.2.1 Linear equations


Any mathematical function that appears as a straight line when plotted in the
coordinate plane is a linear function. One of the most commonly used equations
in business is:
y 5 a 1 bx

Where:
a 5 y intercept – that is, the point on the y axis where the line will cut
b 5 slope or gradient.

The slope can be measured between any two points on the line, it is always the
same and can be defined as:
increase in y
​ 
increase in x  ​
  

You can interpret it as follows: it is the number of units the line rises or falls
vertically (y axis) for each unit of horizontal (x) change from left to right.
When the slope (b) is positive, the line has an increasing trend and when b is
negative, the line has a decreasing trend. In business the slope is seen as the ratio
of change in y to the change in x or the marginal value.
Some examples of linear functions to measure profitability in business are:
• linear cost function
the
• the linear income function
• the linear profit function.
14.2.2 Linear cost function: C(x)
Organisations are concerned about costs because they reflect money flowing out
of the business. These costs are usually to pay for salaries, raw materials, rent,
municipal charges and so forth.
Cost is defined in terms of two components: total variable cost and total fixed
cost. These two components must be added to obtain the total cost. Variable
costs vary with the level of output. The linear cost function is:
C 5 F 1 Vx

299

Statistics_Method_BOOK.indb 299 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

Where:
C 5 total cost
V 5 variable cost per unit
F 5 fixed cost per period

Example 14.1

A company which produces a single product wants to determine the function


that expresses total cost (C) as a function of the number of units produced (x).
The fixed expenditure each year is R50  000. The estimated raw material and
labour cost for each unit produced is R5.50. What will the total cost be to
produce 120 items?
C(x) 5 50 000 1 5.50x
C(x 5 120) 5 50 000 1 5.50(120)
5 R50 660

The y intercept tells us that the cost of producing zero units is R50 000. This
is the fixed cost. The slope tells us that for each unit that the line moves to the
right, the cost increases by R5.50. Therefore, the cost of producing one extra
unit each time is R5.50 and this is then the marginal cost of the product.

Activity 14.1

1. Peter is setting up a small home business to manufacture an item he has


developed. He has invested R10  000 in equipment and can produce each
item for R0.65. Determine the cost function for Peter’s product.
2. A car rental agency leases cars at a rate of R100 per day plus R2.00 per
kilometre driven. Determine the function which expresses the daily cost of
renting a car as a function of the number of kilometres driven in one day.
What will the total cost be for 315 km?
3. The police department is contemplating the purchase of an additional patrol
car. Police analysts estimate the purchase cost of a fully equipped car to be
R180 000. They have also estimated an average operating cost of R4.00 per
kilometre. Determine the function that represents the total cost of owning
and operating the car in terms of the number of kilometres it is driven. What
are the projected costs if the car is driven 50 000 km during its lifetime?

300

Statistics_Method_BOOK.indb 300 2014/12/18 3:01 PM


  Equations and graph construction

14.2.3 Linear revenue function


The money that flows into a business from either selling products or providing
services is referred to as revenue. If we assume that the selling price is the same
for all units sold, then
total revenue (R) 5 price (p) 3 quantity (x)

Example 14.2

A local car rental agent is trying to compete with some of the larger companies
and bought good second-hand cars for his fleet. He also simplified the rental rate
structure by charging a flat R125 per day for the use of the car. The total linear
revenue function is R 5 125x.

If a car was rented out for 20 days last month, what was the total revenue for
the car?
R(x 5 20) 5 125(20) 5 2 500 rand

14.2.4 Linear profit function


Profit is the difference between total revenue and total cost.

P(x) 5 R(x) 2 C(x)

When total revenue exceeds total costs, profit is positive and is referred to as net
gain. When total costs exceed total revenue, profit is negative and it is called net
loss or deficit.

Example 14.3

The price of a single product is R65. Variable costs per unit are R20 for materials
and R27.50 for labour. Annual fixed costs are R100 000. Construct the profit
function and determine the profit if annual sales are 20 000 units.
C(x) 5 100 000 1 47.50x
R(x) 5 65(x)
P(x) 5 65(x) 2 (100 000 1 47.50x)
5 2100 000 2 17.50x

301

Statistics_Method_BOOK.indb 301 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

If 20 000 units are sold, the profit will be:


P(20 000) 5 100 000 2 17.50(20 000)
5 250 000 rand

14.2.5 Break-even analysis


Break-even analysis is used to determine the number of units that must be
sold (either in rands or units of output) for the business to break even, that is, to
neither earn profits nor incur losses. The break-even point will be achieved when
sales produce just enough revenue above variable costs to cover fixed costs.

Steps
1. Construct the total cost function C(x), where x represents the level of output.
2. Construct the total revenue function R(x).
3. Set C(x) 5 R(x) and solve x.

Example 14.4

A product is priced at R10 and the variable cost is R6 per unit. If total fixed costs
are R1 000, the break-even point in units of output sold is:
C(x) 5 1 000 1 6x
R(x) 5 10x
10x 5 1 000 1 6x
10x 2 6x 5 1 000
4x 5 1 000
x 5 250 units
250 units at R10 each will give a break-even income of R2 500.

TEST YOURSELF 14

1. An engineer is interested in forming a company to produce smoke detectors.


The estimated variable costs per unit, including materials and labour,
are R22.50. Fixed costs associated with the formation, operation and
management of the company, as well as the purchase of equipment and
machinery, total R250  000. A market-related selling price would be R30
per detector.

302

Statistics_Method_BOOK.indb 302 2014/12/18 3:01 PM


  Equations and graph construction

a) Determine the number of smoke detectors that must be sold in order for
the company to break even.
b) Determine the break-even value.
c) If marketing research indicated that the firm can expect to sell
approximately 30  000 smoke detectors over the life of the project,
determine expected profits at this level of output.
2. A company produces a product which sells at a price of R25 per unit. Variable
costs are estimated to be R18.75 per unit and fixed costs are R50 000.
a) Determine the break-even level of output.
b) Compute the total cost and total revenue at the break-even point.
c) What will profit be if demand is 7 500 units?
3. A local Gauteng charity organisation is planning a one-week holiday in
Cape Town. The venture is a fund-raising effort. A package deal has been
worked out with a commercial airline whereby the charity will be charged
a fixed cost of R10 000 plus R300 per person. The R300 covers the flight
cost, airport tax, hotel and meals. The organisation is planning to price the
package at R450 per person.
a) Determine the number of persons necessary to break even on the
venture.
b) The goal of the organisation is to net a profit of R10 000. How many
people must participate for the goal to be realised?

303

Statistics_Method_BOOK.indb 303 2014/12/18 3:01 PM


Statistics_Method_BOOK.indb 304 2014/12/18 3:01 PM
  Interest calculations

UNIT

15 Interest calculations

In this unit we use calculation skills to analyse financial information. The


principles of such an analysis are established through the concepts of interest
and the value of money.

After completion of this unit you will be able to:


• understand simple interest, its calculation and application
• understand compound interest, its calculation and application
• understand the calculation and application of annuities.

Interest is the cost of money. When money is borrowed, the cost involved in
using the money is that the lender will be required to pay back more than was
borrowed. When capital is invested, the cost of money will be the interest the
investor receives in return.
The fact is ‘money earns money’.

15.1 Basic concepts
Interest (I) is the money paid for the use of borrowed money or money earned
when capital is invested.
The capital on which the interest is calculated at the beginning of the
transaction is called the principal (P) or present value.
The rate of interest (r) is that percentage of the principal that is to be paid for
each unit of time and is expressed as a percentage per year.
The time period (t) is the period for which the money is borrowed and is
expressed in years or a fraction of a year.
The amount to be repaid at the end of the term, that is, the principal plus the
interest, is referred to as the amount (A) or future value.
Interest can be calculated on the principal sum as:
• simple interest
• compound interest.

305

Statistics_Method_BOOK.indb 305 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

15.2  Simple interest


In simple interest calculations, interest is calculated on the principal sum only
at the end of a specified period such as at the end of a year. The interest is not
available before the end of the term and the interest is not added to the principal
to earn interest on interest.
The standard formulae for calculating simple interest are:
I 5 Prt
A 5 P(1 1 rt)
A5P1I

Where:
I 5 amount of interest
P 5 principal
A 5 amount
r 5 interest rate per annum expressed as a decimal
t 5 time in years or a portion of a year

Note: Exact interest is calculated on a basis of 365 days per year or 366 in a leap
year. Ordinary interest is calculated on a basis of 360 days per year or 30 days
per month.

Example 15.1

1. Thandi borrows R5 000 from Simon. Thandi must repay the R5 000 before
the end of 12 months and the interest is 15% per year.
How much must Thandi pay Simon after 12 months?
I 5 Prt
5 5 000(0.15)(1)
5 750
A5P1I
5 5 000 1 750
5 R5 750

2. Determine the present value at 15% simple rate of interest on an amount of


R12 500 due in one year and nine months.
A 5 P(1 1 rt)
12 500 5 P[1 1 (0.15 3 1.75)]
P 5 R9 900.99

306

Statistics_Method_BOOK.indb 306 2014/12/18 3:01 PM


  Interest calculations

3. B borrows R500 from A and at the end of eight months pays A an amount
of R525.
What is the simple rate of interest earned?
I 5 Prt
25 5 500 (r)​ (  )
8
​  12    ​  ​
r 5 0.075 5 7.5%

4. How long will it take R5 000 to earn R50 interest at 10%?


I 5 Prt
50 5 5 000 (0.10)t
t 5 0.10 of a year, which is one month and six days

Activity 15.1

1. The interest earned by a savings account that earns 8% simple interest


amounted to R6.30 in 90 days. Calculate the amount that was invested.
2. What is the principal of a 12% loan that requires a payment of R1  200
quarterly?
3. An investment of R6 000 at 12% generates interest of R96. Calculate the
time of the investment.
4. A principal of R6 400 earned R288 interest over a 90-day period. What was
the rate?

15.3 Compound interest
When interest is not paid out at the end of each period but continuously added to
the principal, the principal is continuously increasing and we say the interest is
compounded. This means that interest calculated in period one on the principal
amount is added to the principal amount so that the interest calculated in period
two is calculated on the increased balance.
Interest can be compounded annually (once a year), semi-annually (twice a
year), quarterly (four times a year), monthly (12 times a year), or even daily
(365 times per year). If interest is compounded, the interest rate, quoted as
a yearly rate, should be adjusted to a period rate. The time period, which is
normally quoted in years, should be adjusted to the number of interest periods
per transaction.
• For example, if interest is compounded quarterly, and the time period is five

307

Statistics_Method_BOOK.indb 307 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

years, then the number of interest-compounding periods (n) is 5 3 4 5 20.


n 5 number of years 3 number of compounding periods per year
• To obtain the period rate (i) from the yearly rate (r) the average rate per
period method is followed: for example, if the annual rate is 6% compounded
quarterly, the period rate is taken to be 5 1.5%

annual rate
i 5 ​ 
number    
per year ​ 

The standard formulae for calculating compound interest are:

A 5 P(1 1 i)n

i 5 ​​ ( ​ AP  ​  )​ ​2 1
​  1n ​
 

A
log​  
 ​
​ log(1 P+ i) ​ 
t=

Where:
A 5 amount or future value
P 5 principal or present value
i 5 interest rate per period within a year expressed as a decimal
n 5 number of times per year interest must be calculated

Example 15.2

1. Simon lends R1  000 to Thandi at a rate of 15% per annum calculated
monthly. What is the amount she must repay at the end of two years?
The interest rate of 15% is the interest that is charged for the year.
However, if the interest is to be calculated monthly, then the annual interest
rate (15%) must be converted to a monthly interest rate by dividing by 12:
15%
i5 ​  12   ​ 5 1.25%
The two-year time period should change to n 5 12 3 2 5 24
A 5 1 000(11 i)24 5 1 000(1.0125)24 5 R1 347.35
Amount of interest paid:
R1 347.35 2 R1 000 5 R347.35

2. A young man inherited R200  000. He wants to invest a portion of his


inheritance to accumulate R300  000 in 15 years. What portion of the

308

Statistics_Method_BOOK.indb 308 2014/12/18 3:01 PM


  Interest calculations

money should be invested if the money will earn 8% per year compounded
semi-annually and how much interest will be earned over the period?
i 5 4% n 5 30
A 5 P(1 1 i) n

300 000 5 P​​ 1 1 
300 000
4 30
​  100 ( 
​  ​​ ​
     )
P 5 ​ 3.2434  ​  
5 R92 495.53
I 5 300 000 2 92 495.53 5 R207 504.47

3. Determine the interest rate on a study loan which would increase its value
from R36  000 to R50  000 in five years if the interest is compounded
monthly.

(  )
​  A
i 5 ​​  ​  1n ​
 
P ​   ​ ​2 1

5 (​​   )​​ ​ 2 1
50 000 1
​  60
     ​
​  36 000  ​  

5 0.0055
Monthly rate is 0.0055 312 3100 5 6.6%

4. How long will it take for R20 to amount to R30 at 5% compounded quarterly?
A
log​  
 ​
​  log(1 P1 i) ​ 
t5
30
log​  
20 ​
5
​     ​ 
1.25
log(1 1  
​  100  ​)

5 32.64 quarters ≈ 8 years, 1 month, 28 days

Activity 15.2

1. Find the present value of R2 000 due in 18 months if money is worth 11%
compounded semi-annually.
2. R800 is invested in an account which earns 10% compounded quarterly.
Calculate the amount in the account at the end of five years and how much
interest will be earned.
3. Find the time in which R1 000 will amount to R1 500 at 4% compounded
monthly.

309

Statistics_Method_BOOK.indb 309 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

4. Find the rate of interest, compounded quarterly, at which R4  400 will
amount to R8 500 in 16 years.

15.4 Nominal and effective rates of interest


When interest is compounded more often than once a year the annual interest
rate is referred to as the nominal rate (i).The interest actually earned per year,
expressed as a rate of the principal, is known as the effective rate (r) of interest.
r 5 (1 1 i)m 2 1
i 5 (r 1 1)m 2 1
m 5 the number of compounded periods within one year

Example 15.3

1. What is the effective rate of interest equivalent to 10% compounded


monthly?
( 
r 5 ​​ 1 1  )
10 12
​  1 200
   ​  ​​ ​ 2 1
5 0.1047 ≈ 10.47%

2. What is the nominal rate of interest compounded quarterly, equivalent to


12% effective?
1
i 5 (0.12 1 1​)​ ​ 4 ​​ 2 1
5 0.0287, which is 0.0287 34 5 0.1148 ≈ 11.48%

Activity 15.3

1. What is the effective rate of interest equivalent to 7% converted semi-


annually, quarterly and monthly?
2. What is the nominal rate of interest, compounded semi-annually and
monthly, equivalent to 20% effective?

15.5 Annuities
An annuity is a sequence of equal payments made at equal time intervals, such
as instalment payments, pensions, insurance premiums, home loan payments,
rent, etc. The time between successive payments (R) is called the payment
interval and the time between the first payment and the last payment is called

310

Statistics_Method_BOOK.indb 310 2014/12/18 3:01 PM


  Interest calculations

the term of the annuity. The payment interval and the interest period always
coincide, which means that if the interest is compounded monthly, the payments
will be monthly.
Annuities are classified into two main classes:
• Ordinary annuities certain refer to annuities where the regular payments
are made at the end of each payment interval.
• Ordinary annuities due refer to annuities where the periodic payment (R)
falls at the beginning of each payment interval.

15.5.1 Ordinary annuities certain


The regular payments are made at the end of each payment period.
To calculate the future value or amount (A) of an ordinary annuity certain we
apply the following formula:
(1 1 i)n 2 1
A 5 R​ 
i  ​ 

To calculate the present value or principal (P) of an ordinary annuity certain we


apply the following formula:

( 1 2 (1 1 i) )
2n
P 5 R​ 
​  i  ​  ​

Example 15.4

1. Determine the amount of an annuity certain of R150 per month for three
years if money is worth 12% compounded quarterly.
(1 1 i)n 2 1
A 5 R​ 
i  ​ 

3
(1 1  
​     ​)12
5 150 
​  100 ​ 
   5 2 128.80 rand
3
​  
100   ​

2. A student needs R3 000 a year for books for four years with the first R3 000
available one year from now. If the student can get 8% p.a. return on
investment, how much money should he invest now?

( 1 2 (1 1 i) )
2n
​ 
P 5 R​  i  ​  ​

(  )
8
1 2 (1 1   ​  100
   ​)24
5 3 000​   
​ 
 8   ​  
​ 5 9 936.38 rand
​  100
     ​

311

Statistics_Method_BOOK.indb 311 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

3. Arthur wants to have R6 000 in the bank in five years’ time. He plans to deposit
the correct amount at the end of each month to achieve this. What should the
value of each monthly payment be if interest is 15% compounded monthly?
(1 1 i)n 2 1
A 5 R​ 
i  ​ 

(  )
1.25
(1 1  
​  100  ​)60 2 1
 6 000 5 R​ 
​ 
   1.25   ​  

​  100  ​
 
6 000
 R 5 ​    ​ 
88.5745

5 67.74 rand

4. A family buys a refrigerator that sells for R350. They pay R50 deposit
and the balance in 24 equal monthly payments. If the seller charges 18%
compounded monthly, how much will the monthly payments be?

( 1 2 (1 1 i) )
2n
P 5 R​ ​ 
i  ​  ​

(  )
1.5
1 2 (1 1  
​  100  ​)224
 300 5 R​ 
​ 
   1.5   ​  

​  
100  ​
300
 R 5 ​ 
20.03  ​ 
5 14.98 rand

Activity 15.4

1. At the end of every month, Julie deposits R500 in a bank account for her son.
How much will be in the account at the end of four years if it accumulates
interest at a rate of 9%?
2. John wants to accumulate R20 000 to purchase a business upon retiring from
his present job. How much must be put aside at the end of every six months for
10 years if the interest he receives is compounded semi-annually at 6%?
3. A television set is bought for R100 deposit and R580 payable at the end of
the next four quarters. What is the equivalent cash price if the rate is 27%
quarterly?
4. Mr Bones wants to buy a house costing R260  000. If he pays R26  000
deposit, how much will his monthly payments be if he gets a 20-year bond
at 14% converted monthly?

312

Statistics_Method_BOOK.indb 312 2014/12/18 3:01 PM


  Interest calculations

15.5.2 Ordinary annuities due


If the periodic payments fall at the beginning of each payment period (pay in
advance), the following formulae apply:
To calculate the amount or future value:

( [(1 1 i) 21][1 1 i] )
n
A 5 R​ 
​   
i  ​  

To calculate the present value or principal:

( [1 2(1 1 i) ][1 1 i]


)
2n
P 5 R​ 
​   
i  ​  

Example 15.5

1. An investment of R200 is made at the beginning of each year for 10 years. If


interest is 12%, how much will the investment be worth at the end of 10 years?

( [(1 1 i) 21][1 1 i] )
n
A 5 R​ 
​   
i  ​  

(  )
12 10 12
[(1 1  
​  100
  ​) 21][1 1  
​  100  ​]
5 200​ 
​ 
   12     ​  
​ 5 3 930.92 rand
​  
100  ​

2. The premium on a life insurance policy is R60 per quarter, payable in


advance. Determine the cash equivalent of a year’s premiums if the
insurance company charges 10% compounded quarterly for the privilege of
paying this way instead of all at once for the year.

( [1 2(1 1 i) ][1 1 i]


)
2n
P 5 R​ 
​   
i  ​  

(  )
2.5 2.5
[1 2 (1 1  
​  100  ​)24][1 1  
​  100  ​]
5 60​ 
​ 
   2.5     ​  
​5 231.36
​  
100  ​

3. The beneficiary of a life insurance policy may take R10  000 in cash or
10 equal payments, the first to be made immediately. What is the annual
payment if money is worth 12%?

( [1 2(1 1 i) ][1 1 i]


)
2n
P 5 R​ 
​   
i  ​  

(  )
12 210 12
[1 2 (1 1  
​  100   ​) ][1 1  
​  100  ​]
10 000 5 R​ 
​ 
   12     ​  

​  100  ​
 
10 000
​  6.3282 
R5  ​ 
5 1 580.23

313

Statistics_Method_BOOK.indb 313 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

4. The Bell Company plans to open a new retail outlet in its chain of telephone
equipment stores three years from today. How much must Bell invest at the
beginning of each semi-annual to have enough for the estimated costs of
R100 000 if the interest rate is 9%?

( [1 2(1 1 i) ][1 1 i]


)
2n
P 5 R​ 
​   
i  ​  

(  )
4.5 4.5
[(1 1  
​  100  ​)6 2 1][1 1  
​  100  ​]
10 000 5 R​ 
​ 
   4.5     ​  

​  100  ​
 
100 000
​  7.0192  
R5 ​ 
5 14 246.64

Activity 15.5

1. The rent of a building is R15 000 per year payable in advance. If the interest
rate is 6% compounded monthly, what will the equivalent monthly rental,
payable in advance, be?
2. Mr Cute bought a car paying R2 000 deposit and R200 at the beginning of
each week for two years. If the interest rate is 9% compounded weekly, what
was the cash price of the car?
3. A school sets aside R10 000 at the beginning of each year to create a fund in
case of further expansion. If the fund earns 5%, how much does it amount
to at the end of the seventh year?
4. A debt of R5 000, inclusive of 5% interest compounded quarterly, is to be
settled within three years in equal quarterly payments. If the first payment
is due today, what will be the size of each payment?

TEST YOURSELF 15

1. Using the simple interest approach, how much interest will you pay on a
loan of R15 000 at 12.5% interest per annum? How much must you repay
after three years and six months?
2. If the amount plus interest that George had to repay at the end of a one-
year loan was R11  000 and the interest rate was 10% per annum, what
was the amount of the principal sum? Use the simple interest formula to
calculate.

314

Statistics_Method_BOOK.indb 314 2014/12/18 3:01 PM


  Interest calculations

3. A waitress who was temporarily pressed for funds pawned her watch and
diamond ring for R55. At the end of one month she redeemed them by
paying R59.40. What was the annual rate of interest?
4. A 26% interest charge on an overdue account of R800 came to R21. How
late was the account?
5. A mechanic borrowed R125 from a cash loan company and at the end of
one month paid off the loan with R128.75. What annual rate of interest
was paid?
6. At what rate will simple interest on R1 127 amount to R318 in 135 days?
7. John invests R200 at 7.75% simple interest per annum and receives R295
after a certain time. For how long was the money invested?
8. Philemon has an option of financing the purchase of a new music centre,
with a price of R800, through a loan for one year. The interest rate on the
loan is 12% per annum. He has an option of taking a loan with interest
calculated quarterly or a loan where the interest is calculated semi-annually.
Which option will you recommend to Philemon? The lender will apply the
compound interest formula.
9. The outstanding amount on your account is R2 650. If the store charges
24% interest compounded monthly, how much will you owe after three
months if no payment was made during that period?
10. A cell phone company will need R500 000 to replace a piece of equipment in
eight years. How much must be invested now at 6% compounded quarterly
to accumulate this amount?
11. If R500 amounts to R700 in five years with interest compounded quarterly,
what is the rate of interest?
12. A cash loan company charges 36% compounded monthly on small loans.
How long will the loan company take to triple its money at this rate?
13. How long will it take R4  000 to amount to R5  000 at 9% compounded
quarterly?
14. What is the effective rate of interest equivalent to 15% converted semi-
annually, quarterly and monthly?
15. What is the nominal rate of interest, compounded semi-annually and
monthly, equivalent to 24% effective?
16. Which gives the better annual return on investment: 4% compounded
quarterly, 4% converted semi-annually or 4% converted monthly?
17. A refrigerator can be bought for R50 deposit and R28 per month for 24
months, payable at the end of each month. What is the equivalent cash
price if the rate is 26%?

315

Statistics_Method_BOOK.indb 315 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

18. A company bought a machine costing R8 000 and estimates that its useful
life will be five years, after which it will be sold as scrap for R300. The
company decides to set up a reserve fund to cover the cost of a replacement
machine in five years’ time. Equal amounts are to be invested at the end of
each year in an account that earns 10% compound interest. Due to inflation,
it is estimated that the cost of this machine will be R15  000. How much
must be invested each year to cover the cost of a replacement machine,
allowing for the scrap value of the present one?
19. If money is worth 15% compounded quarterly, what single payment today
is equivalent to 15 quarterly payments of R100 each, the first due three
months from today?
20. A cash loan company charges 36% converted monthly for small loans. What
would be the payment at the end of every month if a loan of R250 is to be
repaid within one year?
21. Mr Smith invests R20 at the end of every week at 18% compounded weekly.
What amount will be in his savings account after six months?
22. A student wants to save R15 000 for a trip after graduation, four years from
now. How much must she save at the end of every six months if she gets
15% compounded semi-annually?
23. Mr T. Bone took out a R100 000 loan on a steakhouse over a 10-year period
at an interest rate of 12% compounded monthly. After 3.5 years, interest
rates climbed to 15% compounded monthly. If his repayments were made
at the end of each month, how much did Mr Bone owe at the end of the first
3.5 years? What was his monthly repayment for the remaining 6.5 years?
24. Instead of taking R5  000 from an inheritance, Peter decides to take
monthly payments for a period of five years, with the first to be made
immediately. If interest is 6% compounded monthly, what will be the size
of each payment?
25. Instead of paying R1 250 rent at the beginning of each month for the next
eight years, Mary decides to buy a flat. Considering interest of 15% to be
compounded monthly, what is the cash equivalent of the eight years’ rent?
26. At the beginning of each semester, Abdul invests R900 at an interest rate
of 7% compounded semi-annually to guarantee a sum sufficient to start a
practice for his daughter, who is entering medical school. If his daughter
finishes within eight years, how much will Abdul have for the practice?
27. Dr Kaye wants to spend five years researching a new book on the motor
industry. He calculated that he needs R12 000 a month to live on over the
five years. How much must Dr Kaye deposit today in an account earning

316

Statistics_Method_BOOK.indb 316 2014/12/18 3:01 PM


  Interest calculations

12% interest compounded monthly in order to withdraw R12  000 at the


beginning of each month?
28. James wants to accumulate R500 within the next three months by depositing
money in a savings account at the beginning of each week. The bank pays
4% compounded weekly. How much must he deposit every week to reach his
target?
29. An investment of R20 is made at the beginning of each day in the money
market for one year at 6% compounded daily. How much will the investment
be worth at the end of the year?

317

Statistics_Method_BOOK.indb 317 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

318

Statistics_Method_BOOK.indb 318 2014/12/18 3:01 PM


Appendix 1 and 2

319

Statistics_Method_BOOK.indb 319 2014/12/18 3:01 PM


Statistical Methods and Calculation Skills

320

Statistics_Method_BOOK.indb 320 2014/12/18 3:01 PM


Appendix 3 and 4

321

Statistics_Method_BOOK.indb 321 2014/12/18 3:01 PM

You might also like