0% found this document useful (0 votes)
35 views30 pages

Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views30 pages

Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics

Term Explanation
data Facts or information collected from people or objects. Data is plural for
datum.
population The entire group of people or objects that data is being collected from.

sample A smaller part of the population if the population is too large.

random How to choose a smaller sample of the population to attempt to not be


biased.
questionnaire A set of printed questions with a choice of answers used in the data
collection process.
survey The collecting of data from a group of people.

discrete data Data that can only take certain values. For example, the number of
learners in a class (there can’t be half a learner).
continuous Data that can take on any value within a certain range. For example,
data the heights of a group of learners (heights could be measured in
decimals).
tally A way of keeping count by drawing marks. Every fifth mark is drawn
across the previous four (to form a gate-like diagram) so you can easily
see groups of five.
frequency A table that lists a set of scores and their frequency. Often used with
tables tallies. It summarises the totals and shows how often something has
occurred.
ogive A cumulative frequency graph. They can be used to determine how
many data values lie above or below a certain value in a data set.
measures of A measure of central tendency is a single value that describes the way
central in which a group of data cluster around a central value. There are
tendency
three measures of central tendency: the mean, the median, and the
mode.
mean The average of a set of numbers. Calculated by adding all the values
then dividing by how many numbers there are.
median The middle number in a sorted list of numbers. To find the median,
place all numbers in order from smallest to biggest and find the middle
number.
mode The number that appears the most often in a set of data. There can be
two modes. There could also be no mode in a set of data.

1
modal class The modal class is the class with the highest frequency from a set of
grouped data. in other words, the interval with the most “members”.
measures of Measures of dispersion like the range, percentiles and quartiles tell you
dispersion about the spread of scores in a data set. Like central tendency, they
help you summarize a set of data with one or just a few numbers.
range The difference between the highest and lowest value in a set of data.

percentiles Each of the 100 equal groups into which a population can be divided
according to the distribution of values of a variable.
The value below which a percentage of data falls.
quartiles Each of four equal groups into which a population can be divided
according to the distribution of values of a variable.
The values that divide a list of numbers into quarters.
interquartile The interquartile range (IQR) is a measure of variability, based on
range dividing a data set into quartiles. Quartiles divide a rank-ordered data
set into four equal parts. The values that divide each part are called the
first, second, and third quartiles; and they are denoted by Q1, Q2, and
Q3, respectively.
histogram A graph representing data that is grouped into ranges and each bar
represents data that follows on from the previous bar. Example, one
bar could represent how many learners got a mark from 40-49 and the
bar immediately next to it would represent 50-59.
outliers These are values that are significantly higher or lower than all the other
values in the data set. They are also called extremes. Outliers can
affect the mean of the data and are sometimes excluded when
calculations are done.
skewed data A measure of the asymmetry of the distribution. If the data is
represented visually, the curve will be distorted.
estimated An estimate of the mean can be determined for grouped data. Unlike
mean listed data, the individual values for grouped data are not available, and
you are not able to calculate their sum. To calculate the mean of
grouped data, the first step is to determine the midpoint of each
interval, or class. These midpoints must then be multiplied by the
frequencies of the corresponding classes. The sum of the products
divided by the total number of values will be the value of the mean.
variance Variance measures how far a data set is spread out. It is the average
of the squared differences from the mean.

2
standard A quantity expressing by how much the members of a group differ from
deviation the mean value for the group. Standard deviation is the square root of
the variance.
ungrouped Ungrouped data has not been classified or has not been subdivided in
data the form of groups. This type of data is totally the raw data. Ungrouped
data is just in the form of a list of numbers.
grouped data Data that has been ordered and sorted into groups called classes.
Data that has been bundled together in categories. Histograms and
frequency tables can be used to show this type of data
five number Lowest value, lower quartile, median, upper quartile and highest value
summary from a set of data.
The five numbers are used to draw a box and whisker plot.
box and A simple way of representing statistical data on a plot in which a
whisker plot rectangle is drawn to represent the second and third quartiles, usually
with a vertical line inside to indicate the median value. The lower and
upper quartiles are shown as vertical lines either side of the rectangle.
The lowest value and highest value in the data set are represented at
each end.
scatter plots A graph in which the values of two variables are plotted along two
axes. The pattern of the resulting points reveals whether there is any
correlation between the two sets of values.
line of best fit A line of best fit is a straight line drawn through the centre of a group of
data points plotted on a scatter plot. It is drawn intuitively.
least squares ‘Least squares’ is a statistical method used to determine an equation
regression line for a line of best fit by minimizing the sum of squares created by a
mathematical function. A "square" is determined by squaring the
distance between a data point and the regression line.
correlation A statistical measure of the linear relationship (correlation) between a
coefficient dependent and independent variable. It is represented as 𝑟 and its
value varies from 1 to -1.

3
Grade 11 concepts

STANDARD DEVIATION

Deviation means ‘how far from the normal’. The standard deviation is a measure of how
spread out the data is. The symbol used is 𝜎 which is the Greek letter, sigma 𝜎
Variance is required to find standard deviation – so what does that mean? It is the average
of the squared differences from the mean.

The following example, using the height of 5 dogs, will be used to explain both variance and
standard deviation further. Note what we are finding out about the data provided when we
find standard deviation – we are finding what the norm is and which data lies within the norm
and which data lies outside the norm.
The heights of 5 dogs are found and recorded:

Steps to follow:
 Find the mean of the heights
 Find the difference between each dog’s height and the mean (some answers will
be negative)
 Square the differences
 Find the average of the squared differences
 Square root the answer.

Height Mean Difference Diff squared Mean of squares


600mm 206 42 436
470mm 1970 76 5 776 108520
170mm 5 −224 50 176 5
= 394mm = 21 704
430mm 36 1 296
300mm −94 8 836

4
Draw a horizontal line onto the Why do you think we need to square these numbers
diagram representing the before we can find the mean of them?
mean measurement. (if we found the mean of a set of positive and negative
integers it would not represent the data as the answer
could even be quite close to zero)

Note again that 21 704 is the variance. It is the mean of the squared differences. Clearly, this
very large number could not tell us anything about how far each dog’s height might be from
the mean.
Why do you think we square root this number to find standard deviation?
(to ‘undo/reverse’ the squaring that was done to alleviate the problem of the negative
integers)
√21704 = 147,322 …
The standard deviation is 147mm (rounded to the nearest whole).
Add 147mm to the mean (394 + 147 = 541) and subtract 147mm from the mean
(394 − 147 = 247)
Draw a horizontal line at these two measurements. Shade the ‘bar’ created.

The shaded bar represents the heights within one standard deviation from the mean.
This tells us, that after taking all the data into account, we can see which dogs fall within one
standard deviation of the mean and which dogs are considered ‘outside the norm’ and are
either very big or very small.
We could make a wider bar if we added the standard deviation again to the top of the bar
(541+147) and subtracted it from the bottom of the bar (247-147) to get 688 and 100
respectively. If we drew in the horizontal lines, we would now be seeing which dogs lay
within TWO standard deviations from the mean.
But what is considered to be the norm in a set of data?
 66% should lie within one standard deviation from the mean
 95% should lie within two standard deviations from the mean
 99,7% should lie within three standard deviations from the mean
However, this is only likely to be true if the set of data is large enough.

5
The example we used above would not be considered sufficient data to make any realistic
conclusions.

Calculator work (Casio is used for the steps and diagrams)

Steps to follow What you should see (dog heights used)


MODE, choose STAT (2)

Choose 1-VAR (1)

Enter data, pressing = after each


number

Press: AC; SHIFT, STAT (1)

Choose Var (4)

6
Choose 𝜎𝑥 (3)

Press equal

You should only use the long method (as in the table about the dogs) if specifically asked.
If this does occur, which is very rare, there is usually a table given for you to complete.
If variance is required in a question, you need to find the standard deviation on your
calculator and then square it.

7
SKEWED DATA

In general, data is skewed if there are outliers – data that is not part of the norm according to
the rest of the data. Outliers are values that are significantly higher or significantly lower than
the rest of the data.
If a histogram is drawn of a set of data and looks as follows,

the data represented is said to be normally distributed. The mean and median will be equal
(if the data is perfectly distributed) or very close to each other.

The mean is susceptible to the influence of outliers and is not always a good representation
of the data. Both the mean and median are good representations of the data if the sample is
normally distributed.
If the data is skewed, the mean tends to be ‘dragged’ in the direction of the skewness – in
this case the median would be a better measure of central tendency. The more skewed the
data, the greater the difference between the mean and the median.

The table on the next page summarises the information.

8
Skewed data Notes
Negatively skewed Positively skewed If the mean and median are
(mean subtract median is (mean subtract median is known and there is no
negative) positive) visual representation of the
data, this method can be
mean < median < mode mode < median < mean used to find in which
Mean will be to the left of the Mean will be to the right of direction the data is
median the median skewed

If a histogram or
distribution curve is given
(remind learners that the
curve is a representation of
the histogram), the ‘tail’ will
show in which direction the
(Longer tail on left = skewed (Longer tail on right =
data is skewed.
to left) skewed to the right)
If a box and whisker plot is
available, the longer box
will show in which direction
Skewed to the left – the data Skewed to the right – the the data is skewed. Tell
is more spread out on the left data is more spread out on learners to shade the
the right. longer part of the box and
write ‘skewed left’ and
‘skewed right’ in the
appropriate one.

9
GROUPED AND UNGROUPED DATA

Ungrouped data: raw data that has not been classified. A list of numbers.
Grouped data: data that has been sorted into groups (classes).
To find the mean of ungrouped data - Add all the values and divide by the number of values.
Why can’t we use the same method to find the mean of grouped data?
We don’t know each value – only how many there are in a certain class interval.

Finding the mean of grouped data through the following example:

We need to represent the information in the histogram in a table to assist us. We need these
headings:
Class Interval Frequency Midpoint of class Midpoint ×
(percentage) interval frequency

 Class interval: The size of each class into which a range of a variable is divided.
In this case, 61 – 65, 66 – 70 etc. they are also often written as inequalities. In
this case, 61 ≤ 𝑥 ≤ 65 could have been used.
 Frequency: How many sets of data lie in this interval
 Midpoint of class interval: the middle value of the class interval
 Midpoint × frequency: calculating the total if each number in the class interval is
the midpoint

10
Note how the table has been completed and confirm you know where each part comes
from.
Class Interval Frequency Midpoint of class Midpoint ×
(percentage) interval frequency
61 – 65 1 63 63
66 – 70 0 68 0
71 – 75 2 73 146
76 – 80 3 78 234
81 – 85 6 83 498
86 – 90 1 88 88
91 – 95 6 93 558
96 - 100 3 98 294

So why do we find the midpoint of each interval?


We don’t know the actual data, but we have a reasonable idea of the results, so we use the
midpoint of each interval. Essentially, we are assuming that each learner achieved a result
of the midpoint of the class intervals.
Finally, we need to add the totals and divide by 22 (the number of learners) to find estimated
mean.
1881
= 85.5
22
Always look at your answer and ask yourself if it looks reasonable. In other words, does it lie
in the range of data (61 – 100).
None of the measures of central tendency should ever lie outside the data range.
Let’s have a look at how the other measures of central tendency could be asked with
grouped data.
Which range of marks (class interval) was the most common?
(81 – 85 and 91 – 95 both had six learners in them)
What measure of central tendency is being discussed? (the mode)
However, as we don’t know the exact values, we can’t give a mode so instead we talk about
a modal class – the class with the most data in it.
This data has 2 modal classes: 81 – 85 and 91 – 95.
How can we find where the median value will lie?
We would need to know the total number of learners represented and find where the middle
value would be.
There are 22 learners in total represented here. The median position would therefore be
between 11 and 12. Which class interval would the median be in? (81 – 85).

11
CUMULATIVE FREQUENCY CURVE/OGIVE

A cumulative frequency curve used to represent the accumulated amount of data.

Due to the totals being accumulated, an ogive CAN NEVER GO DOWN. If zero is added to a
previous total it will remain the same and the graph would be flat for that interval.
Remember that an ogive should always form an S-like shape.

An example to work through to understand further:

The following table represents the mathematics results of 155 learners.


Marks Frequency
15 – 20 2
20 - 25 5
25 - 30 8
30 - 35 10
35 - 40 13
40 - 45 17
45 - 50 20
50 - 55 16
55 - 60 12
60 - 65 15
65 - 70 17
70 - 75 20

Now we are going to add a column to the right and accumulate the totals as we work our
way down the table.

What are we expecting the total number of learners to be when we get to the end? (155)

12
Note the arrows showing how the first total (2) is being added to the frequency of the next
interval (5) to get the new cumulative frequency (7).
Marks Frequency Cumulative frequency
15 – 20 2 2
20 - 25 5 7
25 - 30 8 15
30 - 35 10 25
35 - 40 13 38
40 - 45 17 55
45 - 50 20 75
50 - 55 16 91
55 - 60 12 103
60 - 65 15 118
65 - 70 17 135
70 - 75 20 155

Now we are going to add another column to the right for the coordinates.

Important to note:
 An important feature of an ogive is that it shows the total number of observations
in a data set that are less than or equal to the UPPER boundary (think of ‘up to’)
 The lower boundary of the first interval is used for the first coordinate.
Thereafter, only upper boundaries are used.
 Coordinates are made up of: (upper boundary; cumulative frequency)

Note how the table has been completed on the next page.
A recommendation: write the first coordinate above the others. It will be (15; 0). This
represents that zero learners got less than 15% - this will be the starting point of the ogive on
the horizontal axis. The ogive is ‘grounded’ to show that there are no values lower than the
lower boundary of the first class interval.

13
Marks Frequency Cumulative frequency Co-ordinates
(15; 0)
15 – 20 2 2 (20; 2)
20 - 25 5 7 (25; 7)
25 - 30 8 15 (30; 15)
30 - 35 10 25 (35; 25)
35 - 40 13 38 (40; 38)
40 - 45 17 55 (45; 55)
45 - 50 20 75 (50; 75)
50 - 55 16 91 (55; 91)
55 - 60 12 103 (60; 103)
60 - 65 15 118 (65; 118)
65 - 70 17 135 (70; 135)
70 - 75 20 155 (75; 155)

Let’s look at what some points represent:


(50; 75) means that 75 learners got less than 50 or exactly 50
(65; 118) means that 118 learners got less than 65 or exactly 65

The drawing of the ogive. Note the following:


 always label the graph so it is clear what data is represented
 the vertical axis will always represent the cumulative frequency and should be
marked as such
 draw the ogive freehand – remember it is a curved graph

14
Now we can use the ogive to estimate some measures of dispersion.
To find the estimated median:
 Find the middle number in the total value (155 ÷ 2 = 77.5 – the decimal will not
be important – remember that it is an estimate)
 Mark this number on the vertical axis and draw a horizontal line to the ogive
 From this point, draw a vertical line down to the horizontal axis
 The reading will give the estimated median.

The estimated mean is 51% or 52%

To find the lower quartile:


 Find a quarter of the total value ( × 155 = 38.75 – the decimal will not be

important – remember that it is an estimate)


 Mark this number on the vertical axis and draw a horizontal line to the ogive
 From this point, draw a vertical line down to the horizontal axis
 The reading will give the estimated lower quartile
The estimated lower quartile is 40%
You could also be asked to read information from the graph given information on the
horizontal axis. For example, estimate how many learners got 60% or more.
 Find 60 on the horizontal axis and draw a vertical line to the ogive
 Draw a horizontal line to the cumulative frequency (vertical axis)
 However, keep in mind that the question said 60 or more (the reading would be
the answer if the question had been 60 or less)
 Use the reading to subtract from the total number of learners (155)
The number of learners who achieved more than 60% is 52 (155 – 103)

15
Three fully worked examples from previous Grade 12 exams:

Example
The amount of money, in rands, that learners spent while visiting a tuckshop at school on a
specific day was recorded. The data is represented in the ogive below.

An incomplete frequency table is also given for the data:

a) How many learners visited the tuckshop on that day?


b) Write down the modal class of this data.
c) Determine the values of 𝑎 and 𝑏 in the frequency table.
d) Use the ogive to estimate the number of learners that spent at least R45 on the day the
data was recorded at the tuckshop.
MAR 2017
Solutions: Notes
a) 65 learners (a) This is the total represented in the last coordinate
b) 30 ≤ 𝑥 < 40 (b) This can usually be seen by the part of the curve (from one
coordinate to the next) that increases the most quickly. It is safer
however, to look at each of the 𝑦 −coordinates and calculate which
interval has the most data.
c) 𝑎 = 12 (c) The value of 𝑎 can be easily read from the coordinate (20; 12) as
𝑏 = 61 − 45 no values have been accumulated yet.
= 16 The value of 𝑏 requires a subtraction calculation: the accumulated
amount at the end of that interval subtract the accumulated amount at
d) 10 or 11 the end of the previous interval.

16
Example
The box and whisker diagram below shows the marks (out of 80) obtained in a History test by
a class of 9 learners.

a) Comment on the skewness of the data.


b) Write down the range of the marks obtained.
c) If the learners had to obtain a mark of 32 to pass the test, estimate the percentage of
the class that failed the test.
d) In ascending order, the second mark is 28, the third mark 36 and the sixth mark 69.
The seventh and eight marks are the same. The average for the test is 54.
28 36 69

Fill in the marks of the remaining learners in ascending order.


MAR 2016
Solutions: Notes
a) The data is skewed to the left (a) The data is clearly more spread out on the left-hand
side looking at the box.
b) 80 − 20 = 6060 (b) Largest value – smallest value
c) 25% of the learners failed (c) 32% is marked as the lower quartile. Therefore, the
quarter of learners in the first quartile must have all
failed.
d) 54 = (d) First fill in the lowest and highest values – these are
486 = 445 + 𝑥 easily read from box and whisker plot.
41 = 𝑥 75 is also a mark clearly shown on the plot and as it
lies between 69 and 80 it must be the both the 7th and
8th value as they were given as equal.
62 is also a mark clearly shown on the plot but until the
final value is found we can’t know exactly which of the
final two places it will go.
If the 8 values we know are totaled, we know that total
plus the one unknown will give an average of 54.

17
Example
A company recorded the number of messages sent by e-mail over a period of 60 working
days. The data is shown in the table below:

a) Estimate the mean number of messages sent per day, rounded to two decimal places.
b) Draw a cumulative frequency graph (ogive) of the data on the grid.
c) Hence, estimate the number of days on which 65 or more messages were sent.
MAR 2016
Solutions: Notes

a) = 51,33 (a) Find the midpoint of the interval, multiply


it by the frequency, total the frequencies and
divide by 60
(b)
b)
 First make a cumulative frequency
column
 ‘ground’ the ogive. The first
coordinate is always the lower part of
the lowest boundary and 0. In this
case (10; 0)
 All the other coordinates are made up
of the upper part of each boundary
with the corresponding cumulative
frequency
 Join the points freehand – it should
resemble a curve
(c) Find 65 on the horizontal axis
representing the number of messages. Read
off the corresponding number on the vertical
axis (cumulative frequency). Subtract this
c) 60 – 48 = 12 days reading from the total as it said ‘or more’.

18
Further reading, listening or viewing activities related to this topic are available on the
following web links:
https://www.youtube.com/watch?v=sBK_oE8KDx8
(drawing an ogive)
https://www.youtube.com/watch?v=n5rhuZDbYCM
(determining skewness in ogive curves)
https://www.youtube.com/watch?v=XSSRrVMOqlQ
(what is skewness)
https://www.youtube.com/watch?v=mwT3ykS8r08
(skewed data and outliers)
https://www.youtube.com/watch?v=WVx3MYd-Q9w
(calculating standard deviation)
https://www.youtube.com/watch?v=qqOyy_NjflU
(how to calculate variance and standard deviation)
https://www.youtube.com/watch?v=KwpcKCX51ro
https://www.youtube.com/watch?v=kJrhyb6aG3A
(estimated mean)

19
Scatter plots, least squares regression line, correlation coefficient

SCATTERPLOTS
A scatterplot is a graph in which the values of two variables are plotted along two axes.
The main reason for representing the information on a scatterplot is to find if there is a
connection (correlation) between the two sets of data. The pattern of the resulting points
reveals this.

Using the scatterplot note the


following:
 This scatterplot shows the marks of a class of learners for mathematics and science
 Each point represents one learner. His/her mark is read off as a coordinate
 The 2 lowest marks represented here are 44% for mathematics and 52% for Science
and 44% for mathematics and 56% for Science
 The 2 highest marks represented here are 82% for mathematics and 86% for
Science and 86% for mathematics and 84% for Science
 As the points are all packed closely together, it shows that there is a strong
correlation (the degree of correlation). In other words, there is strong relationship
between the mathematics marks and the science marks.
 Due to the cluster of points all forming an increasing pattern, there is a positive
correlation (the type of correlation).
 One could say that the higher your mathematics mark, the higher your science mark
is likely to be or the lower your mathematics mark, the lower your science mark is
likely to be.
 Note here that if the pattern had been clustered together but decreasing, the
correlation would still have been strong, but it would be a negative correlation. In
other words, one could say the higher the mathematics mark, the lower the science
mark or the lower the mathematics mark, the higher the science mark.

20
A summary to describe correlation:
Degree of correlation:
No correlation Weak correlation Strong correlation Perfect correlation

Type of correlation:
Positive linear Negative linear Exponential Curved (parabolic)

You can draw in a line of best fit on a scatterplot. In Grade 9 you did this – it wasn’t
calculated it was just a line taking into account all the points plotted.
The line of best fit is a line that each person draws intuitively and not using mathematical
calculations.
In Grade 12, the actual best fit line is drawn.

THE LEAST SQUARES REGRESSION LINE

As we are finding the equation of a straight line, what do you expect it to look like?
𝑦 = 𝑚𝑥 + 𝑐 or 𝑦 = 𝑎𝑥 + 𝑞
As you know, different variables can be used to represent the gradient and the 𝑦 −intercept.
On the calculator, in this section, 𝐴 is used for the 𝑦 −intercept and 𝐵 is used for the
gradient.
Also, don’t be put off by the calculator using 𝐴 + 𝐵𝑥 which is not the way you are familiar
with.

21
Calculator work:
Steps to follow What you should see
1 MODE, choose STAT (2)

2 Choose 𝐴 + 𝐵𝑋 (2)

3 Enter data, pressing = after each number in each column


4 Press: AC; SHIFT, STAT (1)
5 Choose Reg (5)

Note:
𝐴, 𝐵 and 𝑟 are all important. For finding the equation of the least squares regression
line, 𝐴 and 𝐵 are relevant now.
6 Choose 𝐴 (1) and press =

7 Repeat steps 4 and 5


8 Choose 𝐵 (2) and press =

22
Use this table of three sets of points to practice your calculator work. The solutions are
below.

Set 1: 𝑦 = 1,571 + 0,179𝑥 Set 2: 𝑦 = 160,521 − 0,963𝑥 Set 3: 𝑦 = 6,474 + 0,067𝑥

When drawing the line, we need to ensure it is accurate. The y-intercept is usually the
easiest point to plot, but the other point will be the average of each set of values. That can
also be found on the calculator. The steps are as follows and can be used from the screen
with both sets of data entered:
Press: AC; SHIFT, STAT (1)
Choose Var (4)

Choose 𝑥̅ (2)
Read as 𝑥 bar – this is the mean of al the
𝑥 −values.
Press =

Press: AC; SHIFT, STAT (1)


Choose Var (4)

Choose 𝑦 (2)
Read as 𝑦 bar – this is the mean of all the
y−values.
Press =

This point needs to be plotted and used to draw the line. If the 𝑦 −intercept is not a value on
the axes, you can substitute any 𝑥 −value into the equation found and find a corresponding
𝑦 −value.
23
Fully worked examples:
Notes for both examples:
Remember to:
 draw axes and label them clearly
 consider the scale and what numbers are needed on each axes
 find the coordinates from the table
 if the 𝑦 − intercept does not lie within the range available, any point can be found
using substitution (as in the first example where 4 was used)

Example Notes
The table below shows the shoe size and mass of First think about what we may be
10 boys. looking for with this set of data -
a correlation between the size shoe a
boy wears and his mass.

a) Find the equation of the least squares Would you expect a correlation?

regression line Yes – in general, the bigger a boy is,

b) Draw a scatter plot the higher is mass is likely to be.

c) Draw the least squares regression line onto


the scatter plot. When the equation of the least
squares regression line has been
found, check: is the gradient positive
or negative? (positive)
was this expected? Yes – we noted
that the correlation was likely to be
positive.
Solution:
a) 𝑦 = 44,464 + 4,086𝑥
b) and c)

24
Example: Notes
The table below shows the number of people who First think about what we may be
visited a museum over a 10-day period during the looking for with this set of data -
summer holidays and the hours of sunshine on a correlation between the number of
each day. visitors to the museum and how
sunny the day was.
Would you expect a correlation?
a) Find the equation of the least squares Yes – but a negative one – the less
regression line sunshine there is the more people are
b) Draw a scatter plot likely to visit a museum.
c) Draw the least squares regression line onto
the scatter plot. When the equation of the least
squares regression line has been
found, check: is the gradient positive
or negative? (negative)
was this expected? Yes – we noted
that the correlation was likely to be
negative.
Solution:
a) 𝑦 = 470,326 − 40,443𝑥
b) and c)

25
DESCRIBING CORRELATION

Look at the first scatter plot concerning shoe size and mass. What degree of correlation
would you say there is between the two sets of data? (Strong)
Why? The points are clustered quite close together forming a straight line.
And note the gradient - It is positive, the line slopes in an upwards direction.
This tells us that there is a strong positive correlation between the two sets of data.

Look at the second scatter plot concerning sunshine and a museum visit. What degree of
correlation would you say there is between the two sets of data? (Strong)
Why? The points are clustered quite close together forming a straight line.
And note the gradient - It is negative, the line slopes in a downwards direction.
This tells us that there is a strong negative correlation between the two sets of data.

Calculators are used to find an actual value that can tell us how strong or weak the
correlation between the data is. It will also tell us whether the correlation is positive or
negative.
Remember: If the correlation is positive, it means as one value gets bigger the 2nd value is
likely to also get bigger.
If the correlation is negative, it means as one value gets bigger the 2nd value is likely to get
smaller.

Calculator work.
Use the instructions for finding 𝐴 and 𝐵 in the least squares’ regression line.
On the screen where 𝐴 and 𝐵 are listed, note the 𝑟 value as mentioned earlier.
This is the correlation coefficient. Press 3 and =.
Use the data from the previous two examples. You should get 0,8954 and −0,93447
respectively.
The correlation coefficient will ALWAYS lie from −1 to 1. The closer it is to −1 or 1 the
stronger the correlation is. The close it is to zero, the weaker the correlation is. A correlation
coefficient of −1 or 1 would show a perfect correlation between the two sets of data. A
correlation coefficient of 0 would show no correlation at all.

Remember that the negative correlation does not mean there isn’t a correlation. It
means that as one set of data gets bigger, the other set is getting smaller.

26
The following summary will assist you in describing the correlation between two sets of data.
The correlation coefficient (𝑟) ranges from −1 to 1. This can be written in the form

−1 ≤ 𝑟 ≤ 1

Regarding the least squares’ regression line:


Using the shoe and mass data and the equation y = 44,464 + 4,086x that was found,
two types of questions could be asked.
Using the equation of the least squares regression line to estimate the mass of a boy with a:
(a) size 8,5 shoe
(b) size 4 shoe
Both values are those represented by the 𝑥 −values.
However, size 8,5 falls within the range of data given but size 4 lies outside the range..
We can substitute the values and find the answers:
a) 𝑦 = 44,464 + 4,086𝑥 b) 𝑦 = 44,464 + 4,086𝑥
𝑦 = 44,464 + 4,086(8,5) 𝑦 = 44,464 + 4,086(4)
𝑦 = 79,195 𝑦 = 60,808
Because the first value was within the range of data given, it is said that we are
interpolating, and the result is considered to be reliable.
Because the second value was NOT in the range of data given, it is said that we are
extrapolating, and the result is considered to be unreliable.

27
Fully worked examples and notes

Example
A survey was conducted at a local supermarket relating the distance that shoppers lived from the
store to the average number of times they shopped at the store in a week. The results are shown
in the table below.

a) Use the scatter plot to comment on the relationship between the distance a shopper lived
from the store and the average number of times she/he shopped at the store in a week.
b) Calculate the correlation coefficient of the data.
c) Calculate the equation of the least squares regression line of the data.
d) Use the answer in (c) to estimate the average number of times that a shopper living 6kkm
from the supermarket will visit the store in a week.
e) Sketch the least squares regression line on the scatter plot.
NOV 2016
Notes
The questions below are all very similar to the those already covered. Remember to find the (𝑥̅ ; 𝑦)
point to ensure that the least squares regression line is accurate.
Solutions:
a) Strong e)
b) −0,9462 …
c) 𝑦 = 11,71 − 1,12𝑥
d) 𝑦 = 11,71 − 1,12(6)
𝑦 = 4,99
∴ 5 times

28
Example
An ice cream shop recorded the sales of ice cream, in rand, and the maximum
temperature, in 𝐶 , for 12 days in a certain month. The data that they collected is
represented in the table and scatter plot below.

a) Describe the influence of the temperature on the sales of ice cream in the scatter
plot.
b) Give a reason why this trend cannot continue indefinitely.
c) Calculate an equation for the least squares regression line.
d) Calculate the correlation coefficient.
e) Comment on the strength of the relationship between the variables.
MAR 2015
Notes
The questions below are all very similar to the those already covered.
In order to answer (b) they need to think of what is and isn’t possible in real life.
Solutions:
a) As the temperature increases the ice cream sales also increases
b) The temperature cannot increase indefinitely.
c) 𝑦 = −460,35 + 30,09𝑥
d) 𝑟 = 0,96
e) Very strong, positive correlation

29
Further reading, listening or viewing activities related to this topic are available on the
following web links:

https://www.youtube.com/watch?v=jEEJNz0RK4Q
(why a least squares regression line is called this)

https://www.youtube.com/watch?v=40EU5HMrDOw
(how to find the equation of the least squares regression line on the casio calculator)

https://www.youtube.com/watch?v=ugd4k3dC_8Y
(correlation coefficient explained)

https://www.youtube.com/watch?v=jf-SIOFUuEo
(interpreting the correlation coefficient)

30

You might also like