Understanding Statistics in Education
Understanding Statistics in Education
NATURE OF STATISTICS
Statistics...the most important science in the whole world: for upon it depends on the practical
application of every other science and of every art: the one science essential to all political
and social administration, all education, and all organization based on experience, for it only
understand statistics so that we can make the proper judgments when a person or a company
presents us with an argument backed by data. Data are numbers with a context. To properly
perform statistics we must always keep the meaning of our data in mind
What is Statistics?
Statistics is concerned with the process of finding out about real phenomena by
collecting and making sense of data. Its focus is on extracting meaningful patterns
from the variation which is always present in the data. An important feature is the
quantification of uncertainty so that we can make firm decisions and yet know how
likely we are to be right.
Statistics (singular): is the discipline that concerns the collection, organization,
displaying, analysis, interpretation and presentation of data.
1. Statistics: a branch of mathematics dealing with the collection, analysis,
interpretation, and presentation of masses of numerical data
Descriptive Statistics
The descriptive statistics is concerned with describing or summarizing the numerical
properties of data. Some of the methodologies of descriptive statistics include classification,
tabulation, graphical representation and calculation of certain indicators like mean, median,
range etc. which summarises certain important features of data.
In this method, we cannot draw any conclusion but can provide worthy information regarding
the nature of a specific group of individuals.
Inferential Statistics
Inferential Statistics is also known as statistical inference. It is concerned with the derivation
of scientific inference about the generalization of results from the study of a few particular
cases. The methods of statistical inference help in generalizing the results of a sample to the
entire population from which the sample is drawn.
For example, chi-square, t-test, ANOVA, etc.
Why Statistics?
In order to put the topic into context, let me start by listing some of the uses of statistics
before zeroing-in on the specific uses in the teaching profession
(a) Statistics helps in providing a better understanding of a phenomenon
(b) Statistics helps in systematically inquiring into an issue.
(c) Statistics helps in collecting appropriate data
(d) Statistics helps in presenting complex data in a suitable tabular, visual form
through charts and diagrams for easy understanding of the data
(e) Statistics helps in understanding the nature and pattern of variability of a
phenomenon through observations
(f) Statistics helps in drawing valid conclusions (inferences).
Statistics is important to the teaching profession because it helps a teacher know when
teaching has effectively been done. They can use them to determine if the class understands
the material or if they need to cover more of it through administration assignments/homework,
tests and examinations.
Statistics are important to teachers for several reasons, and not just for the obvious one of
checking on students and their progress in school.
These reasons could include: ensuring the quality of education is being kept high; monitor
student’s progress; monitor the teacher’s progress or success, and check the effectiveness of a
subject.
Statistics are produced for the size of school or college; the number of pupils or students
enrolled by gender; composition of teachers by gender, age or qualification; Workload;
Number of classes or periods taught per week; Trend analysis – enrolment, pass rates etc.
It is necessary that those involved in the provision of education at various levels have some of
the statistical skills and reasoning necessary to interpret and use that information about
institutions of learning (In this case schools, and colleges), teachers/lecturers, and
pupils/students to improve the education system.
Statistics such as achievement trends over time, or comparison data for provinces and
comparable systems can help them develop ways to improve student learning. There is a need
2
for the educators to have sufficient understanding of statistics to make use of them in the
prevention of errors in decision-making.
The quality of education is heavily dependent on the performance of teachers. Teachers can
play a very important role in monitoring progress made towards achieving the goal but
specifically with reference to a range of indicators.
The first and possible the most important reason why teachers use statistics is so that they are
able to monitor pupils’/students’ progress throughout the term, semester or year. By giving
pupils/students homework/assignments, tests and end of term/semester/year examinations,
teachers are able to keep track of pupils’/students’ performance.
A teacher plays so many roles and one of these is guidance and counselling. This is a skill
that every teacher must have as they are always involved in guiding and counselling
pupils/students. There is a positive correlation between guidance and counselling and career
decision-making. Effective guidance and counselling have a positive influence on students’
career decision-making. In order for this to happen, the use of statistics is necessary.
3. School Attendance
Encouraging regular school attendance is one of the most powerful ways of preparing
children for success both in school and in life. When you make school attendance a priority,
you help your children get better grades, develop good life habits, and have a better chance of
graduating from school/college. When students are absent for some days, their grades and
numeracy and reading skills may be affected. Every teacher keeps a record of class
attendance of his pupils/students. The attendance records are then analysed using statistics.
School Attendance Ratios can be calculated. Specifically, Net Attendance Ratio (NAR) and
Gross Attendance Ratio (GAR) are often calculated. NAR indicates participation in primary
schooling for the population age 7-13 and secondary schooling for the population age 14-18.
The Gross Attendance Ratio (GAR) measures participation at each level of schooling among
those of any age from 5 to 24 years. All these begin from a classroom.
Analysis of examination results requires the use of statistics. For example, early this year it
was announced that “54 schools scored a hundred per cent pass rate. 2016 recorded an
increased proportion of pupils obtaining BECE with a 4.9 per cent shoot up from 2015. The
per cent of boys who obtained full certificates was higher than that of girls pegged at 63.95
per cent and 59.57 per cent respectively.
The results indicate that grant-aided schools topped the performance list followed by private
and public schools. It is clear that in this examination results announcement, statistics were
used to compare the performance of pupils between 2015 and 2016, boys and girls, provincial
performance, ownership of schools versus performance. This implies that the statistical
concepts of proportion and ranking were used. The use of proportion was to standardise the
3
results in order to enable comparison since the number of pupils differed from one province
to the other. This analysis can also be done by teachers at the class level.
Smaller classes are often seen as beneficial because they allow teachers/lecturers to focus
more on the needs of individual pupils/students.
6. Teachers’/lecturers’ Records
The benefits of statistics that are gathered by teachers and lecturers in classrooms can have
great effects on education institutions and can provide a lot of improvements that will
probably have been overlooked.
If these statistics are looked at and analysed properly then people will have the power to
improve in the weak areas. If this goes on every year, the quality of education will continue
to improve every year.
The most important reason why teachers/lecturers using statistics is that they are able to
monitor students’ progress throughout the school term, semester, or year. Statistics can also
be used by education institutions, in general, to assess how good the students are doing in a
particular subject or course of study.
It can also show where there is possible room for improvement and by analysing this data;
these improvements can be implemented as quickly as possible.
Statistics are used in the calculation of education attainment school attendance ratios. In a
Demographic Health Survey, educational attainment is one of the variables considered in the
background characteristics since it is believed that it is one of the most influential factors
affecting people’s knowledge, attitudes, and behaviours in various aspects of life. The
education attainment is split into female and male for comparison.
Student attrition is the number of students who leave a programme of study before it is
finished. Teacher attrition is the number of teachers/lecturers who do not continue with their
work.
4
Teachers/lecturers are being lost due to a number of reasons such as being assigned to non-
teaching jobs, expiry of contract, resignation, dismissal, retirement and death. Statistics are
very important in education in the policy formulation for decision-makers.
Variables
A variable is characteristic that varies from one person to person, text to text, or
object to object. Simply put, variables are features or qualities that change (Mack & Gass,
2005). A value is an assigned number or label representing the attribute of a given individual
or object. For example, marital status as a variable can be broken down into categories and
given values as never married - 1, married - 2, divorced - 3 and widowed - 4. The number of
children in a family as a variable can be given the values 0, 1, 2, 3, 4 etc. Height can take on
values such as 1.2 metres, 1.7 metres, 2.0 metres and 2.2 metres. Religious affiliation can be
broken down to categories and given values as Christian – 1, Moslem – 2, Traditionalist – 3,
Buddhist - 4.
In contrast with categorical variables, continuous variables are variables that can take on
values along the continuum. For example, age, income, weight and height. Therefore, the
type of data produced differs from one category to another.
Discrete variables are countable in a finite amount of time. For example, you can count the
change in your pocket. Variables that can only take on a finite number of values are called
"discrete variables." All qualitative variables are discrete. Some quantitative variables are
discrete, such as performance rated as 1,2,3,4, or 5, or temperature rounded to the nearest
degree.
2. Qualitative versus Quantitative variables
Qualitative variables are those that vary in kind. Rating something as ‘attractive’ or not,
‘helpful’ or not or ‘consistent’ or not are examples of qualitative variables that vary in kind.
Whereas, reporting the number of times something happened or the number of times someone
engages in a particular behaviour are examples of quantitative variables because they provide
information regarding the amount of something (Marczik, DeMatteo, Festinger, 2005).
5
Scales of Measurement
Depending upon the traits/attributes/characteristics and the way they are measured, different
kinds of data result representing different scales of measurement.
Scales of measurement refers to the particular way that a variable is measured within
scientific research, and scale of measurement refers to the particular tool that a researcher
uses to sort the data in an organized way, depending on the level of measurement that they
have selected. Choosing the level and scale of measurement are important parts of the
research design process because they are necessary for systematized measuring and
categorizing of data, and thus for analysing it and drawing conclusions from it as well that are
considered valid.
Within science, there are four commonly used levels and scales of measurement: nominal,
ordinal, interval, and ratio. Each level of measurement and its corresponding scale is able
to measure one or more of the four properties of measurement, which include identity,
magnitude, equal intervals, and a minimum value of zero.
There is a hierarchy of these different levels of measurement. With the lower levels of
measurement (nominal, ordinal), assumptions are typically less restrictive and data analyses
are less sensitive. At each level of the hierarchy, the current level includes all the qualities of
the one below it in addition to something new. In general, it is desirable to have higher levels
of measurement (interval or ratio) rather than a lower one. Let’s examine each level of
measurement and its corresponding scale in order from lowest to highest in the hierarchy.
A nominal scale is used to name the categories within the variables you use in your research.
This kind of scale provides no ranking or ordering of values; it simply provides a name for
each category within a variable so that you can track them among your data. Which is to say,
it satisfies the measurement of identity, and identity alone.
Common examples within sociology include the nominal tracking of sex (male or
female), race (White, Black, Hispanic, Asian, American Indian, etc.), and class (poor,
working-class, middle class, upper class). Of course, there are many other variables one can
measure on a nominal scale.
The nominal level of measurement is also known as a categorical measure and is considered
qualitative in nature. When doing statistical research and using this level of measurement,
one would use the mode, or the most commonly occurring value, as a measure of central
tendency.
Ordinal scales are used when a researcher wants to measure something that is not easily
quantified, like feelings or opinions. Within such a scale the different values for a variable are
progressively ordered, which is what makes the scale useful and informative. It satisfies both
the properties of identity and of magnitude. However, it is important to note that as such a
scale is not quantifiable—the precise differences between the various categories are
unknowable.
6
Within sociology, ordinal scales are commonly used to measure people's views and opinions
on social issues, like racism and sexism, or how important certain issues are to them in the
context of a political election. For example, if a researcher wants to measure the extent to
which a population believes that racism is a problem, they could ask a question like "How big
a problem is a racism in our society today?" and provide the following response options: "it's
a big problem," "it is somewhat a problem," "it is a small problem," and "racism is not a
problem."
When using this level and scale of measurement, it is the median which denotes the central
tendency.
Unlike nominal and ordinal scales, an interval scale is a numeric one that allows for ordering
of variables and provides a precise, quantifiable understanding of the differences between
them (the intervals between them). This means that it satisfies the three properties of identity,
magnitude, and equal intervals.
Age is a common variable that sociologists track using an interval scale, like 1, 2, 3, 4, etc.
One can also turn non-interval, ordered variable categories into an interval scale to
aid statistical analysis. For example, it is common to measure income as a range, like ¢0-
¢9,999; ¢10,000-¢19,999; ¢20,000-¢29,000, and so on. These ranges can be turned into
intervals that reflect the increasing level of income, by using 1 to signal the lowest category,
2 the next, then 3, etc.
Interval scales are especially useful because they not only allow for measuring the frequency
and percentage of variable categories within our data, they also allow us to calculate the mean,
in addition to the median, mode. Importantly, with the interval level of measurement, one can
also calculate the standard deviation.
The ratio scale of measurement is nearly the same as the interval scale, however, it differs in
that it has an absolute value of zero, and so it is the only scale that satisfies all four properties
of measurement.
A sociologist would use a ratio scale to measure actual earned income in a given year, not
divided into categorical ranges, but ranging from $0 upward. Anything that can be measured
from absolute zero can be measured with a ratio scale, like for example the number of
children a person has, the number of elections a person has voted in, or the number of friends
who are of a race different from the respondent.
One can run all the statistical operations as can be done with the interval scale, and even more
with the ratio scale. In fact, it is so-called because one can create ratios and fractions from the
data when one uses a ratio level of measurement and scale.
7
Comparism
8
UNIT TWO
Data are a set of facts and provide a partial picture of reality. Whether data are being
collected with a certain purpose or collected data are being utilized, questions regarding what
information the data are conveying, how the data can be used, and what must be done to
include more useful information must constantly be kept in mind.
Since most data are available to researchers in a raw format, they must be summarized,
organized, and analyzed to usefully derive information from them. Furthermore, each data set
needs to be presented in a certain way depending on what it is used for. Planning how the
data will be presented is essential before appropriately processing raw data.
First, a question for which an answer is desired must be clearly defined. The more detailed
the question is, the more detailed and clearer the results are.
The type of data collected and how the associated sampling takes place depend on the
statistical question asked.
There are a number of aspects of data collection that need to be considered when carrying out
a statistical investigation. Draw the frequency distribution table and used it to analyse data
3. Explain and use appropriate text, graphs and tables to summarise your data.
Choose the questions (e.g. "Do you like …?", "How tall is …?") that will determine
the type of data collected (e.g. categorical, numerical).
Choose a sample size that will allow confidence in the conclusion.
Use repeated sampling to demonstrate how much variation occurs in samples of
different sizes.
Avoid bias in data collection.
Explore sources of bias in various contexts where data are collected.
Understand the importance of random selection.
Explore various methods of random sampling.
Know the relationship between samples and populations.
Learn to define populations and samples in various statistical contexts.
9
chemical tests: i.e. quality of water, - health testing tools: i.e. blood pressure. -citizen report
cards.
Data trimming is the process of removing or excluding extreme values, or outliers, from
a data set. Data trimming is used for a number of reasons and can be accomplished using
various approaches. As social scientists, communication researchers often work with data
sets that may require the removal of outliers to strengthen a statistic and accomplish a
number of research goals. It is important to understand the impact outliers can have on
data and the approaches available to eliminate or censor these extreme values without
compromising the data set.
1. Analyze your data to make sure the outlier isn’t a result of measurement error or
some other fixable error.
10
2. Decide how much Winsorization you want. This is specified as a total percentage of
untouched data. For example, if you want to Winsorize the top 5% and bottom 5% of
data points, this is equal to 100% – 5% – 5% = 90% Winsorization. An 80%
Winsorization means that 10% is modified from each tail area.
3. Replace the extreme values by the maximum and/or minimum values at the threshold.
For example:
Note that winsorizing is not equivalent to simply excluding data, which is a simpler
procedure, called trimming or truncation, but is a method of censoring data. In a trimmed
estimator, the extreme values are discarded; in a winsorized estimator, the extreme values
are instead replaced by certain percentiles (the trimmed minimum and maximum).
Bootstrapping is any test or metric that relies on random sampling with replacement.
Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance,
confidence intervals, prediction error or some other such measure) to sample estimates. This
technique allows estimation of the sampling distribution of almost any statistic using random
sampling methods. Generally, it falls in the broader class of resampling methods.
For example, let’s say your sample was made up of ten numbers: 49, 34, 21, 18, 10, 8, 6, 5, 2,
1. You randomly draw three numbers 5, 1, and 49. You then replace those numbers into the
sample and draw three numbers again. Repeat the process of drawing x numbers B times.
Usually, original samples are much larger than this simple example, and B can reach into the
thousands. After a large number of iterations, the bootstrap statistics are compiled into a
bootstrap distribution. You’re replacing your numbers back into the pot, so your resamples
can have the same item repeated several times (e.g. 49 could appear a dozen times in a dozen
resamples).
11
Bootstrapping is loosely based on the law of large numbers, which states that if you sample
over and over again, your data should approximate the true population data. This works,
perhaps surprisingly, even when you’re using a single sample to generate the data.
PRESENTATION OF DATA
Data can be classified as grouped or ungrouped. Ungrouped data are data that are not
organized, or if arranged, could only be from highest to lowest or lowest to highest.
Grouped data are data that are organized and arranged into different classes or categories.
This refers to the organization of data into tables, graphs or charts, so that logical and
statistical conclusion can be derived from the collected measurements.
Textual Presentation
Data can be presented using paragraphs or sentences. It involves enumerating important
characteristics, emphasizing significant figures and identifying important features of data.
The data gathered are presented in paragraph form.
Data are written and read.
It is a combination of texts and figures.
Example:
Of the 150-sample interviewed, the following complaints were noted: 27 for lack of books in
the library, 25 for a dirty playground, 20 for lack of laboratory equipment, 17 for a not well-
maintained university building.
Example.
You are asked to present the performance of your section in the Statistics test. The following
are the test scores of your class:
34 42 20 50 17 9 34 43
50 18 35 43 50 23 23 35
37 38 38 39 39 38 38 39
24 29 25 26 28 27 44 44
49 48 46 45 45 46 45 46
12
Solution First, arrange the data in order for you to identify the important characteristics. This
can be done in two ways: rearranging from lowest to highest or using the stem-and-leaf plot.
Below is the rearrangement of data from lowest to highest:
9 23 28 35 38 43 45 48 17 24 29 37 39
43 45 49 18 25 34 38 39 44 46 50 20 26
34 38 39 44 46 50 23 27 35 38 42 45 46
50
OR
In the Statistics class of 40 students, 3 obtained the perfect score of 50. Sixteen students got a
score of 40 and above, while only 3 got 19 and below. Generally, the students performed
well in the test with 23 or 70% getting a passing score of 38 and above.
Bar Chart/Graph
Data that are from nominal scales or categorical and are represented in graphic form
with the use of bar graphs. Bar graphs give a pictorial description of the data and emphasize
how groups compare with one another. They are used to compare the sizes of the various
parts. The height of the bars is the basis for the comparisons and not the area of the bars.
Data is presented in the form of a rectangular bar of equal breadth. Each bar represents one
variant /attribute. The suitable scale should be indicated and the scale starts from zero. The
width of the bar and the gaps between the bars should be equal throughout. The length of the
bar is proportional to the magnitude/ frequency of the variable. Bar graphs are either column
or horizontal. Column graphs are more popular in education. Column bar graphs are simple,
compound (multiple) or component. Examples are shown below.
13
Figure 2 is a compound column bar graph showing school enrolment at Ayeduase Basic
School by gender.
14
3. Construct equally wide and equally spaced bars for each category with the height of the
bar being the value/score for the category on the horizontal axis, which has the names of
the categories as the label.
4. Where computer software such as Microsoft Excel and SPSS are not available, it is
recommended that graph sheets be used.
5. Shade/colour the bars to differentiate bars and components.
Uses
Teachers can use bar graphs in several ways. Enrolment by classes, courses and subjects
and inter-house competitions can be represented by bar graphs.
Pie Chart
Pie charts use nominal or categorical data. Pie charts are represented in the form of a circle
of 360 0 sliced into the shape of ‘pies. Each pie is cut from an angle at the centre of the circle.
The angle corresponds to the data for each category or group. Pie charts give a pictorial view
and the contributions of the parts that make a whole. An example is shown below.
15
100
X 360 0 =60
600
Figure3: Performance
in Jackson SRC games
in Ashanti Region
Constructing pie
charts
1. Calculate the degree
equivalents for the
value of each
category/group by
dividing the total
point for each group
by the overall total
points and multiple
the result by 3600.
For example, for
Louis centre above we
have:
120 100
360 0 72 0 and for KASS centre, we have 360 0 60 0
600 600
2. Use a pair of compass and protractor to draw the circle and the sectors based on the
degrees calculated.
3. Shade/Colour the sectors to differentiate one from the other.
Uses
Pie charts can be used by teachers and educational practitioners for examination results
by the number of passes in various subjects, school enrolment by class, form or subjects.
Line graphs
Data that are related to time are best used for line graphs. Time could be days, weeks,
months and years. Line graphs show changes in the data over a period of time. Data
from interval and ratio scales are most appropriate. Line graphs could be simple or
16
compound. Simple line graphs give a pictorial description of the data. Compound line
graphs compare group data over a period of time.
17
Sep 40 50
Oct 30 35
Nov 50 50
Dec 90 60
Uses
Teachers and educational practitioners can use line graphs in several ways. Examination
results over a period of years in a subject, total school enrolment as well as enrolment by
subjects and courses for a period of time can be represented by line graphs.
Tabulation
18
Tables are the devices, that are used to present the data in a simple form. It is probably the
first step before the data is used for analysis or interpretation.
Types of tables
1) Simple tables: Measurements of a single set are presented
2) Complex tables: Measurements of multiple sets are presented
Simple Table
When characteristics with values are presented in the form of a table, it is known as a
simple table e.g. Table Infant mortality rate of selected countries in 2004.
In the frequency distribution table, the data is first split up into convenient groups (class
interval) and the number of items (frequency) which occur in each group is shown in adjacent
columns. Hence it is a table showing the frequency with which the values are distributed in
different groups or classes with some defined characteristics.
19
5) The base or source of data should be mentioned with the pattern of analysis in the footnote
at the end of the table
Example of grouped, relative, and cumulative frequency distributions of serum cholesterol
levels in 200 men.
Features
1. Class. A group of scores.
2. Class interval. The range within which a group of scores lie. It has a number at the
beginning and at the end. E.g. 90- 95.
3. Unequal Class Interval. These result where there are differences in the range of the
intervals. E.g. 91 – 95, 91 - 100
4. Open-ended classes. These are classes with a value at the beginning of the end. e.g. 90
and above, 45 and below, below 46, above 90.
5. Class limits. The endpoints of a class interval. The smaller number is the lower limit
and the bigger number is the upper limit.
6. Class boundaries. The exact or real limits of a class interval. The lower-class
boundaries are obtained by subtracting 0.5 from the lower-class limit. The upper-class
boundaries are obtained by adding 0.5 to the upper-class limits. A class interval with
limits of 91 – 95 produces class boundaries of 90.5 - 95.5
7. Class size/class width. The number of distinct/discrete scores within a class interval.
They are obtained by finding the difference between successive lower-class limits or
upper-class limits in cases of equal class intervals. They can also be obtained by finding
the difference between successive class marks in cases of equal class intervals or between
class boundaries for each interval.
8. Class mark: The midpoint for each class interval.
9. Frequency: The number of distinct scores from the given data that can be found in a
class interval.
10. Cumulative frequency. The successive sum of the frequencies starting from the
frequency of the bottom class.
11. Cumulative percentage frequency. The successive sum of the percentage frequencies
starting from the percentage frequency of the bottom class. It is also obtained by
expressing each cumulative frequency as a percentage.
12. Relative frequency. It is obtained by dividing each frequency by the total frequency.
20
13. Cumulative relative frequency. The successive sum of the relative frequencies starting
from the frequency of the bottom class.
21
5. Aim at classes with equal sizes or width. This facilitates the interpretation of the
information from the frequency distribution.
6. The number of classes should not be too small (i.e. not less than 5) and not too large (i.e.
not more than 20). Where the number of classes is less than 5, class size should be
reduced but when the number of classes is more than 20, the class size should be
increased.
Histogram
Histograms use data from the ratio or interval scale and depend on frequency distributions. It
uses the classes and the frequencies from the frequency distribution table. Used for
quantitative, continuous variables. It is used to present variables which have no gaps e.g. age,
weight, height, blood pressure, blood sugar etc. It consists of a series of blocks. The class
intervals are given along the horizontal axis and the frequency along the vertical axis. An
example is shown below.
To construct a histogram
1. Draw two axes, a vertical and horizontal. Label the vertical axis by frequency and the
horizontal axis scores/classes.
2. Select an appropriate scale on the vertical axis considering the highest/largest value.
When using a graph sheet, the scale should be such that the bars are not too tall nor too
short.
3. Use class midpoints/marks or class boundaries or class limits to label the points on the
horizontal axis.
4. Drawbars of equal width representing the classes from a frequency distribution table with
corresponding heights as the frequencies.
Importance
1. It gives a pictorial description of the raw data, providing information about the nature of
the data.
2. It gives the direction of performance in terms of academic performance (i.e. skewness).
22
F F
40 40
r r
e e
30 30
q q
20 20
10 10
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Classes Classes
3. It provides an estimate of the most typical score. This is the intersection of the two
diagonals of the tallest bar.
Frequency Polygon
Frequency polygon uses data from ratio or interval scales and depends on frequency
distributions. It uses the classes and the frequencies from the frequency distribution table.
An example is shown below.
F
r
e
q
Classes
1. Draw two axes, a vertical and horizontal. Label the vertical axis by frequency and the
horizontal axis scores/classes.
2. Select an appropriate scale on the vertical axis considering the highest/largest value.
When using a graph sheet, the scale should be such that the polygon is not too pointed or
too short.
23
3. Use class midpoints/marks or class boundaries or class limits to label the points on the
horizontal axis.
4. Plot at the midpoint of each class or the midpoint of the histogram the relevant heights as
the frequencies. Join the midpoints with a straight line.
5. Where the line has not touched the horizontal axis, extend the line one class in that
direction so that the polygon touches the horizontal axis.
Importance
1. It gives a pictorial description of the raw data, providing information about the nature of
the data.
2. It provides an estimate of the most typical score. This is the point on the horizontal axis
where the highest point of the polygon is located.
3. It is used to compare the performance of groups. E.g. Performance in a class test for
Forms 1 and 2 can be shown as follows.
Form 1
Form 2
10 20
The diagram shows that Form 2 class, which is more to the right, performs better. The most
typical scores, where the highest point of the polygon is located can be used to confirm the
comparisons. Where the total frequencies are not the same, use relative frequencies in place
of the actual frequencies to draw the polygon.
A B C
24
Cumulative Percentage Frequency Polygon (Ogive)
Ogives are drawn from frequency distribution tables. Data from ratio or interval
scales are most appropriate.
25
Plot the graph using the upper-class boundaries of each class against the cumulative
percentage frequencies.
C 100
U 80
M 70
% 60
50
F 40
R 30
E 20
Q 10
0 10 20 30 40 50 60 70 80
CLASSES
To construct an ogive,
1. Obtain cumulative percentage frequencies.
2. Plot the cumulative percentage frequencies in each class on the vertical scale. Choose
appropriate scales, on a graph sheet, such that the ogive is not distorted.
1. Label the horizontal axis as scores or classes.
2. Plot at the upper-class boundary of each class the relevant values of the cumulative
frequency. Join the points with a straight line.
5. Extend the line one class to the left so that the polygon touches the horizontal axis.
Importance
1. It is used for comparisons of distributions of performance especially for distributions
where the class/group sizes are not the same. Generally, the graph that moves more to the
right has better performance. The median score obtained at the cumulative frequency of
50 is also used.
26
Given the following performances in a test, draw two ogives. Which school performed better?
School A School B
Classes Frequency Cum. % Freq Frequency Cum. % Freq
91 - 100 1 100 7 100
81 – 90 2 99 17 95.3
71 – 80 11 97 30 84
61 – 70 24 86 25 64
51 – 60 20 62 15 47.3
41 – 50 16 42 11 37.3
31 – 40 12 26 19 30
21 – 30 8 14 14 17.3
11 - 20 4 6 6 8
1 - 10 2 2 6 4
Total 100 150
2. It is used to determine percentiles and percentile ranks. Later in the course, you will learn
how to obtain the percentiles and percentile ranks
A box and whisker plot is drawn below. Later in the course, you will learn how to obtain the
percentiles and quartiles.
Q1 Q2 Q3
P10 P90
An example.
Assume that the following values were obtained for two classes, Form 1A and Form 1B in a
class test in Mathematics.
P10 Q2 P90
Form 2A 15 42 73
Form 2B 29 56 93
The information is presented below by two box and whisker plots.
Form 2A
27
Q1 Q2 Q3
P10 P90
Form 2B
Q1 Q2 Q3
P10 P90
0 25 50 75 100
15 29 42 56 73 93
It can be observed that P10, Q1, Q2, Q3, and P90 values are greater in Form 1B than in Form
2A. This means that performance is better in Form 1B than in Form 2A.
Also, note that the graph for Form 1B has moved more to the right towards higher values than
that of Form 2A.
UNIT THREE
APPLICATION OF THE CENTRE OF A DISTRIBUTION
We have discussed some interesting features of a quantitative data set and learned how to
look for them in pictures (graphs).
The description of statistical data may be quite elaborate or quite brief depending on two
factors: the nature of data and the purpose for which the same data have been collected.
While describing data statistically or verbally, one must ensure that the description is neither
too brief nor too lengthy. The measures of central tendency enable us to compare two or more
distributions pertaining to the same time period or within the same distribution over time.
What is “central tendency,” and why do we want to know the central tendency of a group of
scores? Let us first try to answer these questions intuitively. Then we will proceed to a more
formal discussion.
ACTIVITY
Imagine this situation: You are in a class with just four other students, and the five of you
took a 5-point quiz. Today your tutor is walking around the room, handing back the quizzes.
28
He/she stops at your desk and hands you, your paper. Written in bold black ink on the front is
“3/5.”
Are you happy with your score of 3 or disappointed? How do you decide? You might
calculate your percentage correct, realize it is 60%, and be appalled. But it is more likely that
when deciding how to react to your performance, you will want additional information. What
additional information would you like? If you are like most students, you will immediately
ask your classmates, “What did you get?” and then ask the tutor, “How did the class do?” In
other words, the additional information you want is how your quiz score compares to other
students' scores.
You, therefore, understand the importance of comparing your score to the class distribution
of scores. Should your score of 3 turn out to be among the higher scores, then you'll be
pleased after all? On the other hand, if 3 is among the lower scores in the class, you won't be
quite so happy. This idea of comparing individual scores to a distribution of scores is
fundamental to statistics. So let's explore it further by reading about the three different ways
of defining the centre of a distribution. All three are called measures of central tendency.
These measures are also called Averages. They provide single values which are used to
summarise a set of observations/data. The three main measures are the Mean, Median
and Mode.
1. They are used as single scores to describe data.
2. They help to know the level of performance by comparing with a given standard of
performance. Performance may be above average or below average where the average is
a standard such as the mean or median.
3. They give the direction of student performance.
Where Mean >Median, the distribution is skewed to the right (positive skewness) showing
that performance tends to be low.
Where Mean < Median, the distribution is skewed to the left (negative skewness) showing
that performance tends to be high.
Illustration
THE MEAN ( X )
There are three types. These are Arithmetic, Geometric and Harmonic. In Education, the
Arithmetic mean is the most useful.
29
The Arithmetic Mean.
Methods
The Arithmetic Mean ( X ) can be obtained from both the ungrouped and grouped data. It
can also be easily obtained from Microsoft Excel.
1. Ungrouped data
Given the following scores, 15, 12, 10, 10, 9, 20, 14, 11, 13, 16, to obtain the mean, all the
scores are added and divided by the total number of observations. The mean is represented
by the symbol, X
15 12 10 10 9 20 14 11 13 16 130
X 13
10 10
Generally, the letter, X, is used with a subscript to differentiate the numbers as follows.
E.g. 15, 12, 10, 10, 9, 20, 14, 11, 13, 16
X1, X2, X3, X4, X5, X6, X7, X8, X9, and X10
2. Grouped data
Two methods can be used. These are the long method and the coding method. The methods
are used with frequency distributions.
Long method: X
fx OR X fx where f is the frequency and x, the class marks.
n N
30
Long method X
fx 1665 33.3
n 50
Coding method: X AM
fd i , which is used for distributions with equal class intervals.
n
AM, is the assumed mean, f, is the frequency, d is the code for each class, n is the total
frequency and i, the class size.
To use the coding method, class intervals must have the same size. The class in the
middle or the class with the highest frequency is chosen for the code of 0. Classes above the
zero coded class are given positive codes and those below are given negative codes in steps
of 1.
fd i 35 33 15 33.3
Coding method X AM 33
n 50 50
OPTIONAL
Using Microsoft Excel
1. Open Excel
2. Type in data to be used in one column, if data is not yet entered.
3. Click an empty cell where you want the result to be and type in Mean.
4. Click the empty cell directly below where you typed Mean.
5. Click white space to the right of the fx symbol.
6. Type in =AVERAGE A (cell number where data begins from: cell number where
data ends at). E.g. =AVERAGE A (B2:B32). This means that data begins at cell B2
and ends at cell B32.
7. Press Enter. (The mean is given in the empty cell clicked.
31
Properties of the Mean
1. The mean is influenced by every score or value that makes it up. If a score is changed,
the values of the mean changes.
3, 4, 2, 4, 7 Mean = 4
3, 4, 7, 4, 7 Mean = 5. The change of the score 2 to 7 has changed the mean to 5.
3. The mean is a function of the sum (or aggregate or total) of the scores.
X
X
N
NX X This implies that the number of observations multiplied
by the mean gives the sum of the scores.
Of the three measures it is the only one that is a function of the sum of the scores.
It is also possible to calculate the mean for a combined group if only the means and number
of scores (N) are available.
4. If the mean is subtracted from each individual score and the differences are summed, the
result is 0.
4 – 4 =0
2 – 4 = -2
32
3 – 4 = -1
6–4=2
5–4=1
The distance of the score from the mean is known as the deviation.
5. If the same value is added to or subtracted from every number in a set of scores, the mean
goes up or goes down by the value of the number.
For example, given 8 2 10 4 X 6.
Now add 2 to each score: 10 4 12 6 X 8 ie 6 + 2
6. If each score is multiplied or divided by the same value, the mean increases or decreases
by the same value.
For example, given 8 2 10 4, X 6.
Now multiple each score by 3. 24 6 30 12 X 18 ie 6 × 3
(n 1)
For odd set of numbers, median occupies the th position.
2
For even set of numbers, find the mean of the two middle numbers or the number at the
(n 1)
th position.
2
The median can be obtained from both ungrouped and grouped data and also from Microsoft
Excel.
33
2. If the number of observations, n, is odd, the median is the number at the centre or the
(n 1)
number at the th position.
2
3. If the number of observations, n, is even, the median is the mean of the two centre
observations.
Examples
3. The score at the 5th position is 50 and at the 6th position is 54. Half-way between 50
(50 54) 104
and 54 is 52 . The median is therefore 52.
2 2
Step 1. Identify the median class. It is the class that will contain the middle score. Find the
N
value of , where N is the total score. This is the position of the middle score. Checking
2
from the cumulative frequency column, find the number equal to the position or the smallest
N 50
number that is greater than the position. From the table above, 25 , therefore the
2 2
number is 30. The class that this number belongs to is the median class. From the table
above, the median class is 31 – 35.
Step 2. Use the formula below to obtain the Median.
34
N
cf
Mdn = L 2 i where
1 f
mdn
L1 is the lower-class boundary of the median class
N is the total frequency
cf is the cumulative frequency of the class just below the median class
i is the class size/width
fmdn is the frequency of the median class
Mdn =
50
2 18 25 18 7
30.5 5 30.5 5 30.5 5 30.5 0.585 30.5 2.9 33.4
12 12 12
OPTIONAL
35
Features of the median
1. It is not influenced by extreme scores. For example, the median for the following
numbers, 2, 3, 4, 5, 6 is 4. If 6 changes to 23 as an extreme score, the median remains 4.
2. It does not use all the scores in a distribution but uses only one value.
3. It has limited use for further statistical work.
4. It can be used when there is incomplete data at the beginning or end of the distribution.
5. It is mostly appropriate for data from interval and ratio scales.
6. Where there are very few observations, the median is not representative of the data.
7. Where the data set is large, it is tedious to arrange the data in an array for ungrouped data
computation of the median.
36
THE MODE
It is the number that occurs most frequently in a distribution.
Given the following scores, 1, 2, 4, 6, 4, 6, 7, 2, 4 the number that occurs most frequently is 4.
This is the Mode. This number appears 3 times.
Given the following scores, 11, 22, 14, 26, 34, 6, 27, 12, 40 no number occurs most
frequently. There is, therefore, no mode.
1. The main advantage is that it is the only measure that is useful for a nominal scale.
2. It is used when there is a need for a rough estimate of the measure of location.
3. It is used when there is the need to know the most frequently occurring value e.g. dress
styles.
4. It is not useful for further statistical work because the distribution can be bi-modal or tri-
modal or no mode at all.
UNIT FOUR
MEASURES OF VARIABILITY
In the previous chapter, we have explained the measures of central tendency. It may be noted
that these measures do not indicate the extent of dispersion or variability in a distribution.
The dispersion or variability provides us one more step in increasing our understanding of the
pattern of the data. Further, a high degree of uniformity (i.e. low degree of dispersion) is a
desirable quality. If in education there is a high degree of variability in the exams scores, then
it can be assumed performance is not uniform.
WHAT IS DISPERSION?
It is clear from above that dispersion (also known as scatter, spread or variation) measures the
extent to which the items vary from some central value. Since measures of dispersion give an
average of the differences of various items from an average, they are also called averages of
the second order. An average is more meaningful when it is examined in the light of
dispersion.
1. The range
2. The Variance
3. The Standard Deviation
4. The Quartile Deviation (Semi-interquartile range)
They are used as single scores to describe individual differences in terms of achievement.
37
For example: 48, 51, 47, 50 Total = 196 Mean = 49 …..(i)
30, 72, 90, 4 Total = 196 Mean = 49 …..(ii)
However, a closer look at the two sets of data shows that the distribution within each set is
not the same. Where the scores cluster around the mean, performance is said to be
homogeneous as in (i). Where the scores move away from the mean, performance is said to
be heterogeneous as in (ii).
THE RANGE
It is the difference between the highest and the lowest values in a set of data.
e.g.: 48, 51, 47, 50 Total = 196 Mean = 49 …..(i) Range: 51 – 47 = 4
30, 72, 90, 4 Total = 196 Mean = 49 …..(ii) Range: 90 – 4 = 86
Features
1. It is easy to compute.
2. It is easy to interpret.
3. It is a crude measure of dispersion and does not take into account all the data/scores.
4. It ignores the spread of all the scores.
5. It uses only two values and does not consider how the other scores relate to each other.
6. The range does not consider the typical observations in the distribution but concentrates
only on the extreme values.
7. It can give a distorted picture of the variation within a set of data.
8. Different distributions can have the same range which would give misleading conclusions.
Uses
1. When data is too scanty or too scattered to justify the computation of a more precise
measure.
2. When knowledge of extreme scores or total spread is all that is needed.
Ungrouped data
This is based on raw data. It is computed by using the following formulae.
Variance (S2, )
2
X X
2
X X X
2 2 2
1. Var ( S
2
) 2. Var ( S 2 ) X 2 3. Var ( S 2
)
n n n n
38
2
Std .Dev ( S )
X 2
X
n
N
Given a set of data as 48 51 50 47 and the mean of the distribution as 49, the variance and the
standard deviation could be computed as follows:
196
X 49
4
X XX X X 2
X2
48 -1 1 2304
51 2 4 2601
47 -2 4 2209
50 1 1 2500
Total 10 9614
2
SD
X X
10
2.5 1.58 Variance = 1.582 = 2.5
n 4
OR
X2 2 9614
492 2403.5 2401.0 2.5 1.58 Var. = 1.582 = 2.5
SD
n X
4
OR
2
SD
X 2 X
9614 196
2
2403.5 2401.0 2.5 1.58
n n 4 4
Grouped data:
This is based on a frequency distribution of the scores.
f X X
2
f X X
2
Long method: SD Var
n n
fX 2 fX fX 2 fX
2
2
2
fd 2 fd
Coding Method SD i This is useful with equal class intervals.
n n
39
Using the short method
Short method:
fX 2 fX
2
2
58765 1665
SD 1175.3 1108.89 8.15
n n 50
50
Coding Method
2
fd 2 fd 133 3
2
Standard Deviation
1. Open Excel
2. Type in data to be used in one column, if data is not yet entered.
3. Click an empty cell where you want the result to be and type in Std. Dev.
4. Click the empty cell directly below where you typed Std. Dev.
5. Click white space to the right of the fx symbol.
40
6. Type in =STDEVPA(cell number where data begins from: cell number where data
ends at). E.g. =STDEVPA(B2:B32). This means that data begins at cell B2 and ends
at cell B32.
7. Press Enter. (The standard deviation is given in the empty cell clicked.
Variance
1. Open Excel
2. Type in data to be used in one column, if data is not yet entered.
3. Click an empty cell where you want the result to be and type in Variance.
4. Click the empty cell directly below where you typed Variance.
5. Click white space to the right of the fx symbol.
6. Type in =VARPA(cell number where data begins from: cell number where data
ends at). E.g. =VARPA(B2:B32). This means that data begins at cell B2 and ends at
cell B32.
7. Press Enter. The variance is given in the empty cell clicked.
An example is below:
41
standard deviation of 2. If each score is multiplied by 10 points to obtain10 20 30 40 50 ,
the standard deviation becomes 2 x 10 = 20 and the variance becomes 102 x 4=400
5. It uses every value in the distribution.
6. It is difficult to calculate for open-ended distributions.
7. It is affected by extreme values. It gives more weight to extreme values.
USES
1. It is used as the most appropriate measure of variation/dispersion when there is reason to
believe that the distribution is normal.
2. It helps to find out the variation in achievement among a group of students. i.e. it
determines if a group is homogeneous or heterogeneous.
Where the standard deviation is relatively small, the group is believed to be homogeneous
i.e. performing at the about the same level. On the other hand, where the standard
deviation is relatively large, the group is believed to be heterogeneous, i.e. performing at
different levels.
To be more precise, the coefficient of variation (CV) is computed.
CV = x 100 If the value of CV is greater than 33, the group is heterogeneous,
x
otherwise it is homogeneous.
With this information, the teacher has to adopt a teaching method to suit each group.
3. It is helpful in computing other statistics e.g. standard scores, correlation coefficients.
4. It is useful in determining the reliability of test scores. The split-half correlation method
or internal consistency methods use the standard deviation of the scores.
In most score interpretations in education and for descriptive statistics, the standard
deviation is preferred to variance because
1. the standard deviation (S), is the natural measure of spread or variation for normal
distributions
2. the variance (S2) involves squaring the deviations and does not have the same unit of
measurement as the original observations.
QUARTILE DEVIATION
Q3 Q1
Method: QD =
2
There are two methods – the median method and the formula method.
42
n 1
2. Find the median (i.e. score at the position) for the data set. The median
2
divides the distribution in two equal parts.
3. Find the median for the first half/part. This median becomes Q1, the first quartile.
4. Find the median for the second half/part. This median becomes the Q3, third quartile.
Example.
Given the following scores, 8, 10, 12, 7, 6, 13, 18, 25, 4, 22, 9.
Q1 Median Q3
Given the following scores, 8, 10, 12, 7, 6, 13, 18, 25, 4, 22, 9, after arranging them in
ascending order as,4, 6, 7, 8, 9, 10, 12, 13, 18, 22, 25
1 1
Q1 = (n+1)th position → (12) = 3rd position
4 4
3 3
Q3 = (n+1)th position → (12) = 9th position
4 4
Q3 Q1 18 7 11
QD 5.5
2 2 2
N
4 cf
Q1 = LQ1 i where
fQ1
LQ1 is the lower-class boundary of the lower quartile class
N is the total frequency
cf is the cumulative frequency of the class just below the lower quartile class
i is the class size/width
fQ1 is the frequency of the lower quartile class
3N
cf
Q3 = LQ3 4 i where
fQ3
43
LQ3 is the lower-class boundary of the upper quartile class
N is the total frequency
cf is the cumulative frequency of the class just below the upper quartile class
i is the class size/width
fQ3 is the frequency of the upper quartile class
Example
Classes Midpoint Freq Cum Freq
X f cf
46 – 50 48 4 50
41 – 45 43 6 46
36 – 40 38 10 40
31 – 35 33 12 30
26 – 30 28 8 18
21 – 25 23 7 10
16 – 20 18 3 3
Total 50
Step 1. Identify the quartile class. It is the class that will contain the quartile of interest.
N 3N
Find the value of , for the lower quartile and for the upper quartile (where N is the
4 4
total score) as positions. Checking from the cumulative frequency column, find the number
equal to the position or the smallest number that is greater than the position. From the table
N 50 3 N 150
above, 12.5 , therefore the number is 18 and 37.5 therefore the number
4 4 4 4
is 40. The classes that these numbers belong to are the quartile classes. From the table above,
the lower quartile class is 26 – 30 and the upper quartile class is 36 – 40
.
Step 2. Use the formulae below to obtain the lower and upper quartiles.
N
4 cf
Q1= LQ1 i =
fQ1
50
4 10 12.5 10 2 .5
25.5 5 25.5 5 25.5 5 25.5 1.5625 27.06
8 8 8
44
3N
cf
Q3= LQ3 4 i =
fQ3
150
4 30 37.5 30 7 .5
35.5 5 35.5 5 35.5 5 35.5 3.75 39.25
10 10 10
OPTIONAL
Using Microsoft Excel
First/Lower Quartile
1. Open Excel
2. Type in data to be used in one column, if data is not yet entered.
3. Click an empty cell where you want the result to be and type in Q1 (Lower Quartile).
4. Click the empty cell directly below where you typed Q1.
5. Click white space to the right of the fx symbol.
6. Type in =QUARTILE (cell number where data begins from: cell number where data
ends at,1). E.g. = QUARTILE (B2:B32, 1). This means that data begins at cell B2
and
ends at cell B32 and 1 means first or lower quartile) .
7. Press Enter. The Q1, first/lower quartile is given in the empty cell clicked.
Third/Upper Quartile
1. Open Excel
2. Type in data to be used in one column, if data is not yet entered.
3. Click an empty cell where you want the result to be and type in Q3 (Upper Quartile).
4. Click the empty cell directly below where you typed Q3.
5. Click white space to the right of the fx symbol.
6. Type in =QUARTILE (cell number where data begins from: cell number where data
ends at,3). E.g. = QUARTILE (B2:B32, 3). This means that data begins at cell B2
and ends at cell B32 and 3 means third or upper quartile).
7. Press Enter. The Q3, third/upper quartile is given in the empty cell clicked.
45
An example is shown below.
QD
CV = x 100 If the value of CV is greater than 33, the group is heterogeneous,
Mdn
otherwise it is homogeneous.
With this information, the teacher has to adopt a teaching method to suit each group.
3. It does not make use of all the information provided by the scores.
46
UNIT FIVE
There are two main measures. These are Percentiles and Percentile Ranks, Z scores and T
scores. Z scores and T scores are often referred to as standard scores.
The main purpose of these measures is to describe an individual’s position in relation to a
known group or the norm group.
PERCENTILES
Definition: They are points in a distribution below which a given percent, P, of the cases lie.
There are 99 percentiles that divide a distribution into 100 equal parts.
Percentiles are individual scores.
Notation: P40 = 60. Sixty is the score below which 40% of the scores lie in a specific
group after the scores have been arranged sequentially. This means that a
student who obtains a score of 60 has done better than 40% of the members in
the specific group.
P75 = 50. Fifty is the score below which 75% of the scores lie in a specific
group after the scores have been arranged sequentially. This means that a
student who obtains a score of 50 has done better than 75% of the members in
the specific group.
A score in one group may be a different percentile in another group.
For example, in Statistics Quiz 1, a student with a score of 15 may be at P90 in the Social
Science group but the same score may put the student at P85 in the Home Economics
group.
P50 is the same as the median. P25 is the first quartile and P75 is the third quartile.
PERCENTILE RANKS
Definition: The percentage of cases falling below a given point on the measurement scale. It
is the position on a scale of 100 to which an individual score lies.
Notation: PR of 60 = 75. Seventy-five is the position for a score of 60 when the
distribution is divided into 100 parts. This means that a student who obtains a
score of 60 has 75% of the scores falling below him/her in the group.
The easiest way to obtain percentiles and percentile ranks is to use the ogive (cumulative
percentage graph).
47
100
90
80
70
60
50
40
30
20
10
0
0 5 10 15 20 25 30 35 40 45
Scores
From the ogive, P60 = 34. PR of a score of 26 is 40.
XX
Formula: Z , T = 50 + 10Z, where mean is 50 and standard deviation is 10.
s
Example. Given that a student obtained 15 in a quiz with a mean of 12 and a standard
deviation of 2. The Z and T scores become
15 12
Z 1 .5 T = 50 + 10(1.5) = 65
2
For z scores, 0 is the mean score. Positive scores are scores above the mean (average)
and negative scores are scores below the mean (average).
An individual’s performance can be described as far above average, above average,
just above average, just below average and far below average.
In case of T scores, 50 is the mean score. Scores greater than 50 are above average
and scores less than 50 are below average.
Z scores range between ─ 4 and + 4 while T scores are between 10 and 90.
Self-Practice
1. A student had a Z score of 2.5. The mean for the class was 60 with a standard deviation
of 4.0. What was the student’s observed score?
XX X 60
Z → 2 .5 → 10 X 60 →X = 10 + 60 = 70
s 4
2. A student obtained a raw score of 70 in an examination. If the raw score gives her a Z-
score of 3.5, what would be the class mean if it is known that the standard deviation is 5.0?
48
XX 70 X
Z → 3 .5 → 17.5 70 X → X = 70 − 17.5 = 52.5
s 5
USES
1. It helps the teacher to know an individual’s position in relation to the rest of the class.
A student with a Z score of 3.2 is performing far above average.
2. It enables the teacher to compare student’s performances in different subjects to know
individual strengths and weaknesses.
Salome has done better in Mathematics than Social Studies, considering the class
performance.
3. It helps the teacher to guide and counsel the student to choose the correct course for a
future career and vocation
49
UNIT SIX
MEASURES OF RELATIONSHIPS
Are stock prices related to the price of gold? Is unemployment related to Stealing? Is
academic performance of students related to attendance? Correlation can answer these
questions, and there is no statistical technique more useful or more abused than correlation
Concept
Natural relationships exist in the world. Parents and children as well as twins have things in
common. Males are normally attracted to females and rain results in good harvest.
The concept of correlation provides information about the extent of the relationship between
two variables. Two variables are correlated if they tend to ‘go together’. For example, if
high scores on one variable tend to be associated with high scores on a second variable, then
both variables are correlated.
Correlations aim at identifying relationships between variables and also to be able to predict
performances based on known results.
The statistical summary of the degree and direction of the linear relationship or association
between any two variables is given by the coefficient of correlation. Correlation coefficients
range between -1.0 and +1.0. Correlation coefficients are normally represented by the
symbols, r and ρ (rho).
Scatter plots
A scatter plot or scatter diagram shows the nature of the relationship between any 2 variables.
To obtain a scatter plot, marks are made on a graph representing the intersection of the two
variables. Scatter plots could either be linear or curvilinear.
Examples
50
Linear relationship
Assumptions
1. The variables are random. Neither the values of X nor Y are predetermined.
2. The relationship between the variables is linear.
3. The probability distribution of X’s, given a fixed Y, is normal, i.e. the sample is
drawn from a joint normal distribution.
4. The standard deviation of X’s, given each value of Y is assumed to be the same, just
as the standard deviation of Y’s given each value of X is the same.
(a) Direction: Positive, (+) High values go with high values and low values go
with low values.
Negative (─) High values go with low values and low values go with
high values.
51
Perfect linear positive correlation Perfect linear negative correlation
52
Low linear positive correlation
1. Pearson Product Moment correlation coefficient (r). This is applicable when both
variables are continuous in nature. It uses interval and ratio scale data. For example,
the relationship between test scores and age of students.
2. Spearman’s rank correlation coefficient (ρ). This is suitable for variables that are both
continuous and ranked. It uses ordinal scale data. For example, ranks in terms of
school attendance and age
3. Phi coefficient (φ). This is used when both variables are natural dichotomies. It is
also applicable for nominal data. For example, the relationship between gender and
political party affiliation.
4. Point biserial correlation coefficient (rpb). This is applicable when one variable is
continuous and the other is a natural dichotomy. It combines nominal scale data with
either interval or ratio scale data. For example, the relationship between gender and
test scores
Computational examples
53
r =
Co var iance ( X , Y )
=
( X X )(Y Y ) =
S X .S Y n S x.S Y
( X X )(Y Y ) ……...................….(1)
( X X ) . (Y Y )
2 2
n XY X Y )
r= .....(2)
[n X 2 ( X ) 2 ][n Y 2 ( Y ) 2 ]
Note: X = 6 and Y 7
Using Formula 1:
33 33
r= = = 0.7
( 44)(50) 46.9
1. The Spearman rank correlation coefficient (ρ): For ordinal scale variables
6 d 2
ρ = 1
N N 2 1
Given the following scores:
54
6 43 12 10 10 0
0.0
7 48 19 2.5 2.5 0
0.0
8 45 20 6.5 1 5.5
30.25
9 45 16 6.5 7.5 -1.0
1.0
10 44 15 8.5 9 -0.5
0.25
43.00
________________________________________________________
6 d 2 643 258
ρ = 1 1 1 1 0.26 0.74
N N 1 2
10100 1 990
2
Φ=
n
This is used when there are only two sub-categories for rows as well as columns i.e. 2x2
2
C=
n 2
This is used when there is at least more than two sub-categories for either row or column.
i.e. 2x3, 3x3, 2x4, 3x4, etc.
Gender
Male Female
Result Total
55
The figures in bold and in bracket are the expected counts in each cell.
r c Oij Eij 2 150 1252 100 1252 50 752 100 752
= 125 125 75 75
i 1 j 1 Eij
2 26.66
Φ= = = 0.365
n 200
The result shows that there is a weak positive association between gender and passing a
driving test.
Hall of Residence
Region of Birth Hall 1 Hall 2 Hall 3 Total
Region 1 40 30 30 100
(30) (30) (40)
Region 2 50 40 60 150
(45) (45) (60)
Region 3 30 50 70 150
(45) (45) (60)
Total 120 120 160 400
The figures in bold and in bracket are the expected counts in each cell.
2
r c
Oij Eij
i 1 j 1
Eij
=
2
40 30 30 30 2 30 40 2 50 45 2 40 45 2
30 30 40 45 45
56
= 3.3+0.0+2.5+0.56+0.56+0.0+5.0+0.56+1.67 = 14.15
2 14.15 14.15
C=
2
0.0342 0.185
n 400 14.15 414.15
The result shows that there is a very weak positive association between gender and passing a
driving test.
1. It is useful for selection and placement. For example, if mathematics scores relate well
with scores in chemistry, then mathematics scores can be used for selection into a
chemistry class without conducting a chemistry selection examination.
2. It is used to determine the reliability of standardized and classroom tests. The
Spearman-Brown split-half method uses correlation coefficients.
3. It aids in the provision of evidences for the validity of assessment instruments.
Construct and criterion-related validity evidences are obtained through the computation
of the correlation between two variables.
4. It puts the teacher in a position to predict the future performance of a student. An
established relationship between two subjects is often used as the basis for predicting
performance, but not with 100% certainty. For example, if those with aggregate 6,
from WASHSCE have been found in the University of Cape Coast to be obtaining First
Class degrees, then it can be predicted that anyone with WASHSCE aggregate 6,
would do well in the University.
5. It is useful for research purposes. A study of the relationship between study habits and
the academic performance of students in the University of Cape Coast would use
correlations.
57
UNIT SEVEN
Conditions/Assumptions
1. The possible values of the independent variable, X, are fixed in advance.
2. The true relationship between the variables, X and Y, is linear and expressed by the
equation,
Y = a + bX +ei known as the regression equation. a and b are parameters of the population
and are estimated while ei is the random error. The equation is the line of regression of Y on
X. a is the Y intercept and b is the regression coefficient or the slope of the regression line.
3. The probability distribution of Y’s, given a fixed X, is normal.
STEPS
1. The first step is to present the variables on a scatter diagram to be sure that the relationship
between the variables is linear.
58
2. Normal equations are solved to obtain the equations for the parameter estimates in raw
score form.
Y = a + bX
∑Y = na + b∑X
∑XY = a∑X + b∑X2
Intercept, a =
Y - b X OR a = Y bX
n
The intercept is the point on the Y axis where X, the independent variable has a value of 0.
Example
The following scores were obtained in Quiz 1 and Final Examination.
Quiz 1 Final
Exam
X Y XY X2
18 75 1350 324
12 55 660 144
10 45 450 100
20 85 1700 400
15 65 975 225
15 65 975 225
14 60 840 196
10 60 600 100
12 50 600 144
11 50 550 121
18 70 1260 324
16 75 1200 256
9 45 405 81
13 60 780 169
17 70 1190 289
b=
XY - nXY =
13535 - 151462 13535 - 13025 510
= = =3.23
X nX 3098 1514
2
2 2
3098 2940 158
OR
11.77
b = 0.93 =3.22
3.36
59
a=
Y - b X =
930 3.23210 251.7
= =16.8 OR a = Y bX =62-3.23(14) = 16.8
n 15 15
Use in prediction
After obtaining the estimates of a and b, the least squares regression line can be drawn using
two values for X (including X = 0 to obtain the intercept). Corresponding Y values are
obtained for the X values and these values are used to draw the estimated regression line.
Values can then be read from the regression line to obtain the predicted values.
Method 1
Given Yˆ 16.8 3.22 X ,
Select two values say 0 and 10 for x and compute the corresponding Y values.
For example:
X = 0, Y = 16.8 +3.22(0) =16.8
X = 10, Y = 16.8+3.22(10) =49
Plot the values (0, 16.8) and (10, 49) on the graph using a graph sheet and draw a
Straight line. Then estimate any value of Y given an X value on the regression line.
Yˆ 16.8 3.22 X =
49
16.8
0 10
Method 2
The estimated regression equation can be used by substituting the given X values to obtain
the predicted values for Y.
60
i. What would be the exam score for a student who obtains 12.5 in Quiz 1?
Yˆ 16.8 3.22(12.5) 57
ii. A student obtained 72 in her exam. However, she did not take part in Quiz 1.
What would be an estimate of her Quiz 1 score?
72 16.8 3.22 X
72-16.8 = 3.22X
3.22X = 72 ─ 16.8
72 16.8
X= =17
3.22
UNIT EIGHT
The horizontal axis is measured in terms of standard deviation units. The values decrease to
the left and increase to the right from the centre.
Suppose the standard deviation is 4 with a mean of 21. The distribution takes the form below.
-3 -2 -1 μ 1 2 3
9 13 17 21 25 29 33
61
Symbol
2
A variable which is distributed normally has the symbol, X ~ N( μ, σ ) where μ, is the
2
mean and σ , the variance. This is read as ‘the variable, X, is distributed as norma1 with
a given mean and a given variance.
Features
1. It is a bell-shaped curve.
2. It is unimodal.
3. It is symmetrical.
4. It is asymptotic.
5. The total area under the curve is 1.0.
6. The mean, mode and median are all equal.
7. When the values of a normal distribution have been converted to standard z-scores, a
standard normal curve is obtained. The standard normal curve has a mean of 0 and a
standard deviation/variance of 1.
-3 -2 -1 0 1 2 3
Mean = mode = median = 0
The mean of 0 also means that the Z value is 0.
8. Areas under the normal curve. Note that these areas are obtained from the table on
normal distributions. Refer to Appendix to follow the areas.
62
4. μ+1.645σ = 0.4500 (45.00%) μ 1.645 = 0.90 (90%) Also 1.65
Basic applications
Finding Probabilities
1. The distribution for a Statistics examination is normal with a mean of 60 and variance
of 64 (i.e. X ~ N (60, 64). A student is selected at random from the class. What is the
probability that the student selected obtains a score above 68? Above 76? Below 52?
68 60
P (X>68) = P ( Z )
8
8
= P(Z )
8
= P( Z 1)
0 1
=0.5000-0.3413
= 0.1587
2. The distribution for a Statistics examination is normal with a mean of 60 and variance
of 64 (i.e. X ~ N(60,64). A student is selected at random from the class. What is the
probability that the student selected obtains a score between 52 and 76? Between 68
and 76?
52 60 76 60
P (52<X<76) = P ( Z )
8 8
8 16
= P( Z )
8 8
= P ( 1 Z 2)
─1 0 2
=0.3413 + 0.4773
63
= 0.8186
1. Given that a distribution of scores is normal, with mean 16 and standard deviation of 2.
About what percent of students obtained scores less than 12? More than 14?
12 16
P (X<12) = P ( Z )
2
4
= P( Z )
2
= P(Z<−2)
─2 0
=0.5000 − 0.4772
= 0.0228 (About 2%. Actual 2.28%)
2. Given that a distribution is normal, with a mean of 50 and a standard deviation of 10.
From a class of 2000 students, approximately how many students obtained scores above
70? Between 40 and 60?
70 50
P (X>70) = P ( Z )
10
20
= P( Z )
10
= P ( Z 2)
=0.5000-0.4772
= 0.0228
Number of students: 0.0228 x 2000 = 45.6 ≈ 46
3. In a promotion examination, a pass mark was fixed at 40. Given that the distribution is
normal, with a mean of 50 and a standard deviation of 5.1, approximately how many
students failed from a class of 400?
40 50
P (X<40) = P ( Z )
5 .1
10
= P( Z )
5 .1
= P ( Z 1.96)
64
─1.96 0
=0.5000-0.475
= 0.025
Number of students: 0.025 x 400 = 10
APPENDIX A
NORMAL DISTRIBUTION TABLE
This z-table (normal distribution table) shows the area to the right hand side of the curve. Use these values to
find the area between z=0 and any positive value. For an area in a left tail, look at this left-tail z-table instead.
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.5 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
65
3.6 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.7 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.8 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
Two-tailed Z-Table
This table shows the area to the left of Z. In other words, the area of a left hand tail. If you want to find the
value between z=0 and a positive number, use the right-hand z-table (above) instead (Hint: if you’re asked to
look at the “z-table”, in most cases you’ll want to be looking at the other z-table!)
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.0120 0.0160 0.0199 0.5239 0.0279 0.0319 0.0359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6064 0.1064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
66