Basic Statistics in Psychology
Basic Statistics in Psychology
IN PSYCHOLOGY
DSC-06
DEPARTMENT OF DISTANCE AND CONTINUING EDUCATION DEPARTMENT OF DISTANCE AND CONTINUING EDUCATION
UNIVERSITY OF DELHI UNIVERSITY OF DELHI
Basic Statistics in Psychology
Editorial Board
Prof. N.K. Chadha
Dr. Swati Jain
Content Writers
DDr. Deepesh Rathore, Dr. Poonam Phogat,
Dr. Shweta Chaudhary
. Poonam Vats, Dr. Halley,, Dr. Swati Jain
Academic Coordinator
Mr. Deekshant Awasthi
Published by:
Department of Distance and Continuing Education
Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110 007
Printed by:
School of Open Learning, University of Delhi
Printed at: Vikas Publishing House Pvt. Ltd. Plot 20/4, Site-IV, Industrial Area Sahibabad, Ghaziabad - 201 010 (600 Copies)
SYLLABUS
Basic Statistics in Psychology
Syllabus Mapping
Unit – I
Introduction to Descriptive Statistics: Level of Measurement, Measures of Lesson 1: Relevance of
Central Tendency: Mean, Median and Mode (Characteristics and Statistics
Computation); Measures of Variability: Range, Semi-Interquartile Range, (Pages 3–28);
Standard Deviation, Variance (Characteristics and Computation). Lesson-2: Central Tendency
(Pages 29–52)
Unit – II
Score Transformations: Standard Scores and Percentile Ranks Lesson-3: Standard Scores
(Characteristics and Computation); Normal Probability Curve: Characteristics (Pages 55–83)
and Application of Normal Probability Curve.
Unit – III
Analysis of Relationships: Meaning, Direction and Degree of Correlation; Lesson-4: Analysis of
Factors Affecting Pearson’s Correlation; Computation of Correlation: Relationships
Pearson’s Coefficient Correlation and Spearman’s Rank Order Correlation; (Pages 87–112)
Prediction and Simple Regression (Concept and Calculation).
CONTENTS
1.9 Summary
1.10 Answers to In-Text Questions
1.11 Glossary
1.12 Self-Assessment Questions
1.13 References
1.14 Suggested Readings
Lesson 2 Central Tendency 29–52
2.1 Learning Objectives
2.2 Introduction
2.3 Measures of Central Tendency: Definition, Properties and Comparison
2.3.1 Mean, Median, and Mode
2.3.2 Comparison of Mean, Median and Mode
2.4 Calculation of Mode, Median and Mean from Raw Scores
2.5 Effects of Linear Score Transformations on Measures of
Central Tendency
2.6 Measures of Variability Range; Semi-Interquartile Range; Variance;
Standard Deviation (Properties and Comparison)
2.6.1 Range and Semi - Interquartile
2.6.2 Variance
2.6.3 Standard Deviation
2.6.4 Quartile Deviation
2.7 Calculation of Variance and Standard Deviation
2.8 Effects of Linear Score Transformations on Measures of Variability
2.9 Summary
2.10 Answers to In-Text Questions
2.11 Glossary
3.11 Summary
3.12 Answers to In-Text Questions
3.13 Glossary
3.14 Self-Assessment Questions
3.15 References
3.16 Suggested Readings
LESSON 1 NOTES
RELEVANCE OF STATISTICS
Structure
1.1 Learning Objectives
1.2 Introduction
1.2.1 Psychological Research
1.2.2 Why do Psychologists Carry Out Research?
1.2.3 What are the Different Types of Research?
1.3 Relevance of Statistics in Psychological Research
1.4 Descriptive and Inferential Statistics
1.5 Levels of Measurement
1.6 Grouped Frequency Distribution
1.6.1 Frequency Distribution
1.6.2 Grouped Frequency Distribution
1.6.3 Steps Involved in Creating a Grouped Frequency Distribution
1.6.4 Real Limits vs. Apparent Limits
1.6.5 Relative Frequency Distribution
1.6.6 The Cumulative Frequency
1.7 Graphical Representation of Data (Histogram, Frequency Polygon, Cumulative
Percentage Curve)
1.7.1 Histogram
1.7.2 Frequency Polygon
1.7.3 Cumulative Percentage Curve
1.8 Solved Illustrations
1.9 Summary
1.10 Answers to In-Text Questions
Self-Instructional
Material 3
1.2 INTRODUCTION
researchers also need to have some qualities for the research process to work well NOTES
such as:
The ability to persist in the face of constant setbacks and challenges related to
limited resources, both in terms of time and money.
The ability to show tolerance in the face of ambiguity.
The ability and desire to carry out research in an ethical manner.
The ability to be logical and think rationally.
The ability to change one’s mind in the face of counter-evidence.
The ability to plan and organise research.
The ability to communicate the result of the study in such a way that it’s easy to
comprehend by students as well as professionals.
There are five main reasons why a researcher carries out research:
Exploration: It is carried out when we know a very little or nothing about a
phenomenon or observation. It tries to address the “what” question in research,
what is that phenomenon? This acts as the first stage in any research which
usually does not lead to any specific answers but helps in developing a much
more extensive study later on when sufficient inquiry and evidence gathering has
been done on the research question.
Description: It describes the details of the phenomenon in its current as well as
socio-cultural and historical context, and its relationships to the various variables
present in the environment. It tries to address the “who” and “how” questions,
who is involved? And how does this event happen? The research question is
very clear at this stage and the outcome is a detailed description.
Explanation: It refers to finding the reasons why a phenomenon occurs. This
type of research is followed after exploratory and descriptive research and
focuses on the “why” question. Past researches can also be looked at to find
the causes and reasons.
Prediction: It describes when a phenomenon or behaviour is likely to occur
again. This is possible once you have a sufficient understanding of the underlying Self-Instructional
Material 5
NOTES reasons and the cause and effect relationship is clearly established. This type of
research focuses on the “when” question.
Control: It refers to changing or influencing the behaviour to improve the quality
of life of an individual or group by making constructive changes and modifying
the unhealthy thought patterns of lifestyle choices that people engage in.
In-Text Questions
1. _______ tries to address “what” question in research.
2. _________describes the phenomenon is likely to occur again.
3. ______provides valuable insights but does not mean that results are correct
all the time.
4. Basic research is also known as ________.
5. A manager using tests of intelligence and personality to select appropriate
candidate for the organization is ________ type of research.
manager using tests of intelligence and personality to select the right candidate for their NOTES
organisation.
Laboratory vs. Field research
Laboratory research is carried out in a controlled environment, where the researcher
has control over all the variables involved in the research process, the researcher
carefully selects the participants and follows all the steps systematically. For example,
Albert Bandura’s Bob doll experiment. Field research on the other hand is carried
out in natural settings where the researcher doesn’t have much control over the
environment. These researches therefore involve the use of observations, and surveys
rather than experiments. For example, a researcher who is interested in analysing
pro-social/helping behaviour. Since these studies are done in natural settings, the
results obtained from these researches are more generalizable as compared to
laboratory research.
Quantitative vs. Qualitative research
Quantitative research involves the use of numbers to gather, analyse, and present data.
Here the data is collected primarily through the use of a questionnaire and is reported
in the form of numbers such as mean, percentiles, standard deviation, correlation
coefficients, etc. The use of numbers makes it easier to summarize data collected from
large samples. Since the data is gathered from a large sample size, the result is more
generalizable. Qualitative research on the other hand involves the use of words and
images to gather, analyse, and present data. Here the data is collected using case
study, interview, or observation techniques and analysed using content analysis, thematic
analysis, etc. The data is gathered from a small sample size, as it is difficult to carry out
detailed interviews and observations because of time and cost constraints, and thus
the result is less generalizable.
Cross-sectional vs. Longitudinal research:
Cross-sectional research is carried out at one point in time, to capture the level of
a variable at that time. For example, the vocabulary level of 5th grade students,
where the data will be collected only once and hence is economical in terms of
time and money required. Longitudinal research on the other hand is carried out
over an extended period of time, to capture the process of change. For example,
vocabulary development of 5th grade students over the period of 6 months, the
Self-Instructional
data will be collected multiple times to track the level and pace of development of Material 7
NOTES vocabulary. Therefore, this type of research is much more costly and time
consuming.
In-Text Questions
6. _________research is carried out in a controlled environment.
7. ______research applies data collection techniques like content analysis,
thematic analysis, etc.
8. _______type of research applies mean, percentiles, standard deviation, etc.
Descriptive statistics
It describes and summarizes the data which helps the researchers clearly identif the
nature of information available to them after data collection. For example, if a researcher
wants to understand the performance of first year undergraduate psychology students,
then he/she can calculate the mean and standard deviation and plot the student’s
Self-Instructional performance using a histogram. There are a variety of statistics that are used to describe
8 Material
data, such as mean, median, mode, standard deviation, range, percentile, percentile NOTES
ranks, correlation coefficients, etc.
Inferential statistics
It is used to draw conclusions about a population by collecting data from a sample
statistic (such as mean). For example, we want to understand whether the performance
of first year psychology students of University of Delhi (let’s assume a population of
size 1,000) is the same as the average performance of first year psychology students
of Ambedkar University (mean = 80). In order to achieve our objective, we can
collect sample data (n= 100) from all the first-year psychology students studying in
different colleges (5 from each university) of University of Delhi and then test our
hypothesis that the average performance of students from both the universities is the
same or not. In our attempt to test our hypothesis, we will use statistical techniques
such as t-test (which will be covered later on) to find out whether there is any difference
in performance or not. There are a variety of techniques that come under inferential
statistics such as z-test, t-test, ANOVA (Analysis of Variance), chi-square test, etc.
A researcher includes a lot of different variables in the research, such as gender, height,
weight, IQ, motivation, personality, self-esteem, job satisfaction, stress level, etc. and
for statistical analysis, assigns a number to a particular variable, for example, an IQ
score of 130, or forming different categories such as 1 for male and 2 for female.
Psychologist S. S. Stevens (1946) identified four different ways of assigning numbers
to observations known as measurement scales:
1. Nominal scale
2. Ordinal scale
3. Interval scale
4. Ratio scale
Nominal scale
Nominal (means “name”) scale is used for variables that are qualitative in nature rather
than quantitative, such as gender, male or female, and others. The requirement for Self-Instructional
Material 9
NOTES using a nominal scale is that the categories must be mutually exclusive (one category
needs to be completely independent of the other) in nature and exhaustive (there must
be enough categories to accommodate all the observations). For example, the results
of the students include two categories, pass and fail, a student who pass the exam
can’t come in the fail category, and vice versa.
Ordinal scale
Ordinal (means “order”) scale is also used in cases where the categories are mutually
exclusive and exhaustive. Ordinal scale is a higher level of measurement compared to
nominal scale, as here other than categorization we also assign ranks to the categories
in such a way that it is easier to identify which category will come first and which
category comes last. For example, when countries are given ratings such as 1, 2, or 3,
it is clear that 1 is better than 2 which is better than 3.
Interval scale
Interval scale represents the next level of complexity than the nominal and ordinal
scales. This scale has all the properties of an ordinal scale with the addition that the
difference (distance) between points on this scale is same across the scale. For example,
when the temperature is measured on a degree Celsius scale then the difference between
30o and 40o is the same as the difference between 5o and 15o, that is 10o degree
Celsius. But on this scale zero is just an arbitrary point, in our example 0o doesn’t
mean there is no heat.
Ratio scale
Ratio scale includes all the characteristics of the interval scale but also an additional
zero point. For example, the Kelvin scale to measure temperature measures temperature
just like the Celsius scale that is the distance between points having the same difference
means the same thing but in addition to that it has a true zero point or absolute zero
temperature which implies an absence of heat.
In-Text Questions
9. _____type of scale is used for variables that are qualitative in nature rather
than quantitative.
10. In ________scale other than organization we also see ranks to the categories.
11. When temperature is measured on degree Celsius _______scale is used.
Self-Instructional
10 Material
NOTES
1.6 GROUPED FREQUENCY DISTRIBUTION
The data collected comes in a variety of forms and we need to organise the data to
make accurate interpretations. This can be accomplished with the help of frequency
distribution, which illustrates the number of observations for a particular category.
For example, if we want to present the different disciplines selected by students
as their undergraduate majors, it can be presented as shown in Table 1
Table 1: Frequency of Majors Selected by University of Delhi Students
Majors Frequency
Psychology 800
Mathematics 500
Economics 1200
English 950
Total = 3450
Grouped scores
When you have a wide range of scores, it is better to combine the scores to create
groups of scores. For example, suppose psychology students obtain marks in statistics
paper as shown in Table 2. We can now group these scores to make class intervals
(range of values that are grouped together) such as 64-65, 66-67, etc.
Self-Instructional
Material 11
65 70 73 76 75 74
66 71 73 76 77 73
66 71 75 79 78 76
70 72 75 80 74 86
70 72 76 73 73 83
Table 3 shows how the grouped frequency distribution would look like, as can
be seen, this type of distribution makes it easier to visualise and understand data.
We can see (table 3), students scored in the range of 65 (lowest) to 83 (highest),
where most of the students scored between 71 to 77. But as you can see, there are
different ways of creating class intervals, column B and column C have the same width
of 3 scores but depending upon where you start your interval, the frequency of scores
varies. In column B, the interval starts with 65 and in column C it starts with 66, so the
question should be, which out of the two is the correct way to form intervals? The
answer to this question lies in some clearly defined guidelines to create intervals, please
keep in mind these are simply guidelines not hard and fast rules.
The guidelines are as follows:
a) Class intervals must be mutually exclusive: It means that the intervals should
not overlap with each other, i.e., scores should not come in more than one
interval.
b) Intervals must be continuous: It means that even if some intervals don’t have
scores in them, one must include those intervals nevertheless. For instance, in
column A, the frequency for the interval 68-69 is 0.
c) Interval containing highest-score should be at top: Placing interval containing
highest value at the top and interval containing lowest value at the bottom makes
it easier to understand the frequency distribution.
d) Intervals should have same width: All the intervals that you create should be
of equal width. For example, within each column in table you will find same
interval width.
e) Interval width should be convenient: Interval width should be equal as well
as convenient such as 2, 3, 4, 5, 10, 20, 50, etc. this makes the data easy to
represent and understand.
Self-Instructional
12 Material
f) No. of class intervals: More the class intervals better will be the accuracy of
interpretation. If you create fewer class intervals, then it will result in wider
intervals and thus there will be more loss to accuracy. For instance, Column A
vs. B and C in table 3.
g) Lower score as a multiple of interval width: It is advisable to make the
lower score of the interval as a multiple of the interval width, this makes the
interpretation easier, for instance, in column A of table 3, 64 being the lowest-
score of class intervals as well as a multiple of 2. This is not the case with
column B and C.
NOTES c) Divide the range by 10 and 20 to find out the largest and smallest interval width,
then select a convenient width between these values.
d) Find the score at which the interval width should begin highest or lowest, it
should be in the multiple of the interval width.
e) List the class intervals with highest value at the top and making continuous
intervals of equal width.
f) Use tally system to count the number of scores within an interval, then convert
the tally into frequency.
Let us understand this with the help of the data given in table 2, the lowest-
score is 65 and the highest-score is 86. The range therefore will be the difference
between the two, which comes out to be 21. Next, we will divide the range by 10 and
20 which will give us 2.1 and 1.05. Now, we need to select a convenient interval width
between these two values, let us select 2 as the width of class interval (symbolized
by i) i = 2.
Next, we need to find out the starting point of the bottom class interval. The
lowest-score is 65, so the starting value could be 64 or 65 that will make 64-65 and
65-66 as the intervals containing the value 65, but only one interval starts with a value
that is in multiple of the width (i = 2) which is 64. Hence, our bottom most interval will
be 64-65 as shown in table 3. Now, we will create intervals of equal width, keep in
mind the class intervals would have been the same if we started with the top most
interval (85-86 or 86-87).
Create a table of frequency of scores and tally marks, as shown in table 4. The
calculated frequency then can be inserted as shown in table 3, column A. Similarly, this
can be done for column B and C.
Self-Instructional
14 Material
86-87 | 1
84-85 0
82-83 | 1
80-81 | 1
78-79 || 2
76-77 |||| 5
74-75 |||| 5
72-73 |||| || 7
70-71 |||| 5
68-69 0
66-67 || 2
64-65 | 1
N= 30
In real life situations it is possible that a student might-score in decimals, for instance,
70.5, in such cases it becomes difficult to place this within an interval containing discrete
values.
Real limits: It extends from half unit below the smallest to half unit above. For instance,
the apparent limit of 70-71 can be converted into real limits of 69.5 (the real lower
limit) and 71.5 (the real upper limit). It important to keep in mind that now scores can’t
fall on the real limit as it is calculated by taking half of the smallest unit of measurement.
Apparent limits: It extends from smallest unit of the measurement in the interval to
the largest.
Self-Instructional
Material 15
NOTES Table 5: Real Upper and Lower Limits for the Distribution of Score
Self-Instructional
16 Material
Self-Instructional
Material 17
N= 30
Self-Instructional
18 Material
NOTES
In-Text Questions
12. ______extends from half unit below the smallest to half unit above.
13. ________shows the proportion of scores within an interval.
14. _______helps us in answering questions such as how many students scored
lower than the upper real limit of each class interval.
1.7.1 Histogram
Histogram consists of group of rectangles where the vertical sides are on the real
lower and upper limits of an interval and the width of the rectangle is the same as the
width of the interval it represents. The height of the rectangle on the other hand is
frequency for a particular interval it represents. The frequency can be both raw as well
as relative frequencies.
Self-Instructional
Material 19
65 70 73 85 78 74
66 71 73 86 79 83
66 71 75 80 82 82
70 72 75 80 80 86
70 72 76 73 82 83
84-86 83.5-86.5 3 30
81-83 80.5-83.5 5 27
78-80 77.5-80.5 5 22
75-77 74.5-77.5 3 17
72-74 71.5-74.5 6 14
69-71 68.5-71.5 5 8
66-68 65.5-68.5 2 3
63-65 62.5-65.5 1 1
N= 30
Self-Instructional
20 Material
NOTES
Frequency
Step 3: Next draw the adjoining rectangles, where the width will be equal to class
interval and height equal to the frequency. Score represented on the x-axis can be the
mid-point value of the interval or it can also be the real lower and upper limits as the
edges of the rectangle.
Step 4: Finally, assign labels to the x-axis and y-axis.
Frequency polygon is a graph made by connecting dots (mid points of the class interval).
x-axis represents the scores, and y-axis represents frequency of scores in the
interval. In order to construct a frequency polygon, following steps can be taken:
Step 1: Create a frequency distribution from the raw data, identify the mid points of
the intervals.
Step 2: Put the mid points on the x-axis and decide on the scale of the x-axis and y-
axis.
Self-Instructional
Material 21
NOTES
Frequency
Step 3: Place the dots on the graph at the intersection of the frequency and mid-point
of the interval and join all the dots with straight lines.
Step 4: Label the axes.
Self-Instructional
22 Material
curve is that it usually appears like S-shaped figure, which is also referred to as NOTES
ogive curve.
Cumulative Percentage
P50
Score= 75.5
The advantage of having this type of curve is that it can help us in determining
the percentile point or percentile ranks. This can be done by drawing a line from the
desired percentage point to the curve and then drawing a vertical line from the point of
intersection to the x-axis. This method gives accurate results given the scaling is done
properly. For example, P50 = 75.5. (Section 1.8)
In-Text Questions
15. ______consists a group of rectangles where vertical sides are on real lower
and upper limits.
16. ______is made with connecting midpoints of class interval.
17. _______indicates the percentage of scores falling below the upper real limit.
Self-Instructional
Material 23
NOTES
1.8 SOLVED ILLUSTRATIONS
1.9 SUMMARY
Percentile point are commonly referred to as percentile, represents a point below NOTES
which a specific number of cases fall.
Percentile rank on the other hand represents, the percentage of cases that falls
below a point on the measurement scale.
1. Exploration
2. Prediction
3. Psychological research
4. Pure research
5. Applied
6. Laboratory
7. Quantitative
8. Qualitative
9. Nominal
10. Ordinal
11. Interval
12. Real limit
13. Relative frequency distribution
14. Cumulative frequency distribution
15. Histogram
16. Frequency polygon
17. Cumulative percentage curve
Self-Instructional
Material 25
NOTES
1.11 GLOSSARY
46 55 65 60 77 87
55 59 81 60 76 64
65 61 84 63 72 63
43 68 56 64 78 65
76 89 96 60 88 80
78 75 77 70 81 78
55 77 78 71 84 74
63 73 89 54 69 77
80 71 90 53 70 58
Self-Instructional
Material 27
1.13 REFERENCES
Garrett, H.E (2005). Statistics in Psychology and Education. Delhi: Cosmo Publications.
King, B.M. & Minium, E.W, (2007). Statistical Reasoning in the Behavioral Sciences
(5th Ed.). Noida: Wiley.
Mangal, S.K. (2012). Statistics in Psychology and Education (2nd Ed.). Delhi:
Prentice Hall of India.
N.K. Chadha (2009) Applied Psychometry. Sage Pub: New Delhi
N.K. Chadha (1991) Statistics for Behavioral and Social Sciences. Reliance Pub.
House: New Delhi
N.K. Chadha and R.L. Sehgal (1984) Statistical Methods in Psychology, ESS
Publications: New Delhi
Aron, A., Aron, E.N., & Coups, E.J. (2007). Statistics for Psychology (4th Ed.).
Delhi: Prentice Hall of India.
Howitt, D and Cramer, D. (2011). Introduction to Statistics in Psychology. London,
UK: Pearsons Education Ltd.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–
680. https://doi.org/10.1126/science.103.2684.677
Self-Instructional
28 Material
LESSON 2 NOTES
CENTRAL TENDENCY
Dr. Poonam Phogat
Associate Professor, Gargi College
University of Delhi
Email-Id: [email protected]
Dr. Shweta Chaudhary
Assistant Professor, Gargi College
University of Delhi
Email-Id: [email protected]
Structure
2.1 Learning Objectives
2.2 Introduction
2.3 Measures of Central Tendency: Definition, Properties and Comparison
2.3.1 Mean, Median, and Mode
2.3.2 Comparison of Mean, Median and Mode
2.4 Calculation of Mode, Median and Mean from Raw Scores
2.5 Effects of Linear Score Transformations on Measures of
Central Tendency
2.6 Measures of Variability Range; Semi-Interquartile Range; Variance;
Standard Deviation (Properties and Comparison)
2.6.1 Range and Semi - Interquartile
2.6.2 Variance
2.6.3 Standard Deviation
2.6.4 Quartile Deviation
2.7 Calculation of Variance and Standard Deviation
2.8 Effects of Linear Score Transformations on Measures of Variability
2.9 Summary
2.10 Answers to In-Text Questions
2.11 Glossary
2.12 Self-Assessment Questions
2.13 References
2.14 Suggested Readings Self-Instructional
Material 29
NOTES
2.1 LEARNING OBJECTIVES
2.2 INTRODUCTION
population. Researches in psychology involves the testing of hypothesis which is done NOTES
based on analysis of the collected data and making accurate inferences. For example,
you conduct a survey outside Fab India store in a mall asking 100 people if they like
shopping at the store.
Psychologists and researchers have many statistical techniques available to
analyse the data however they should carefully choose the technique they would use.
Each technique has its own assumptions, limitations and there should be critical evaluation
done before selecting a technique. The appropriate use of statistics involves the
psychologists or researchers to study statistics theory in depth and critically analyse
the research question being addressed. The aim of the research question, hypothesis
and available resources will help psychologists and researchers to use the best statistical
technique. We have only discussed some of the statistical techniques here to keep it
specific to the topic of discussion.
Central tendency is also known as central location measures which are statistical values
that indicate the central location of a set of data in a distribution. It is important to
know that different datasets may have different measures of central tendency and the
choice of the measure of central tendency being used should be based on purpose or
aim of the research.
2. It is useful in the case of a dataset that has a large set of repetitions. It provides NOTES
clear indication of the most common value.
3. Mode may not represent the central tendency in some distributions.
NOTES
2.4 CALCULATION OF MODE, MEDIAN AND MEAN
FROM RAW SCORES
In this section of the chapter, we will look at the formula for calculating the mean,
median and mode from raw scores.
Calculation of mean
Mean or average is the sum of all the values in the dataset divided by the total number
of values. The formula for the calculation of average is:
μ = (Σx) / N
where x is the sum of all values in the set, and N is the number of items in the set.
Here is an example for a small data set:
Raw scores: 8, 3, 6, 5, 11, 5, 2
Mean of the dataset: (8 + 3 + 6 + 5 + 1 1 + 5 + 2)/7 = 5.71
Calculation of Median
To calculate the median, first arrange the set in an ascending order (smallest to largest).
If there are odd number of values, then median is the middle number. If there are even
number of values, then it is the average of the two middle values.
Raw data scores: 8, 3, 6, 5, 11, 5, 2
Ordered data set: 2, 3, 5, 5, 6, 8, 11
Median of this data set is 5 as it is the middle value. There are 3 values both
before and after the dataset.
Let’s look at the dataset again with one less data value
Raw data scores: 8, 3, 6, 11, 5, 2
Ordered dataset: 2, 3, 5, 6, 8, 11
As there are even number of values, the median is the average of 5 and 6 (the
two middle numbers) which is (5+6)/2 = 5.5
Self-Instructional
34 Material
NOTES Example 1:
Let us try to calculate the mean, median and mode for a frequency table:
Age in years Number of boys
13 5
11 3
9 7
14 6
7 4
Age in years
Number of boys (f) xf
(x)
13 5 65
11 3 33
9 7 63
14 6 84
7 4 28
Total 25 ∑xf = 273
AgMean = xf / N
= 273/25
= 10.92
To calculate the median:
Self-Instructional
36 Material
Cumulative frequency is the number of observations that lie above (or below) a NOTES
particular value in a data set. Hence in the above table each subsequent frequency is
added or accumulated to obtain the cumulative frequency of the next interval.
Here, total frequency N = f = 25
Median = N/2 = 25/2 = 12.5
In Cases where the cumulative frequency is not seen exactly in the frequency
distribution table then the cumulative frequency of the next interval is taken. For
example, in this case the cumulative frequency greater than 12.5 and closer to 12.5 is
cumulative frequency 15, therefore the median is the 15th value which is 9.
Mode for a grouped data showed in a frequency distribution table is the raw
score of the mid-point of the class interval with highest frequency. Mode is also 9 for
this dataset as it occurs the most frequently.
Example 2:
The table below shows the results of ratings of 0-3 given to a TV show in response to
a survey by 50 viewers of a popular OTP platform.
Mean = xf / N
= 68 / 50
= 1.36
This means that the average rating for the show was 1.36.
Self-Instructional
Material 37
NOTES Median:
Cumulative
Rating (x) No of viewer's response (f)
Frequency
0 8 8
1 20 8+20=28
2 18 28+18=46
3 4 46+4=50
N = 50
Median = N/2
= 50/2
= 25
The cumulative frequency greater than and closer to 25 is 28. So median is the
25 value which is 1.
th
Mode for this table is also 1 as it the value that has been most frequent in the
dataset.
In-Text Questions
1. Statistics play a key role in psychology:
a) as it helps interpret data b) as it helps analyze data
c) as it helps us summarize data d) All of the above
2. A dataset of 4 test scores of a student is collected. The average score for the
student was calculated. This is an example of:
a) Inferential statistics b) Corelation statistics
c) Descriptive statistics d) None of the Above
3. Measures of central tendency and measure of variability are used in:
a) Inferential statistics
b) Descriptive statistics
c) Both inferential and descriptive statistics
d) None of the Above
4. All datasets will always have a mode.
a) True
Self-Instructional
38 Material b) False
Linear score transformation involves the manipulation of central tendency. Linear score
transformations refers to the mathematical operations that are performed on a set of
scores or data values in order to change their distribution this linear transformation is
done using the formula:
Y = aX + B
Self-Instructional
Material 39
NOTES Where,
X is the original score,
Y is the transformed score,
a is the scaling factor, and
b is the shift factor
The effects of linear score transformations on various measures of central tendency
are as follows:
Mean: There is a change in the mean value of the dataset due to linear transformations.
The means of the transformed score is equal to the original mean multiplied by the
scaling factor a plus the shift factor b.
Median: The median is also affected in the same way as the raw scores.
Mode: There may or may not be change in the mode of the dataset depending on the
values of a and b.
X Y (+4)
1 5
3 7
4 8
interquartile range (IQR), Variance and Standard Deviation are all measure of variability NOTES
in statistics.
Range is the difference between the largest and smallest values in the set. It gives a
rough idea of how spread out the values are in a dataset.
Range = Highest Score (HS) - Lowest-score (LS)
For example, for a dataset of [1, 5, 6, 10] the range is 10 - 1 which is 9.
Semi-interquartile range (IQR) is defined as half of the difference between the
75th percentile (upper quartile) and the 25th percentile (lower quartile) of the data.
The IQR is a robust measure of variability that is less sensitive to outliers than the
range.
2.6.2 Variance
Variance is a measure of the spread of the data around the mean. It is calculated as the
average of the squared deviations of each data point from the mean. Variance is
expressed in squared units and can be difficult to interpret, but it is a key component in
the calculation of standard deviation. The term variance was used to describe the
square of the standard deviation by R.A. Fisher in 1913. The variance (s2 ) or mean
square (MS) is the arithmetic mean of the squared deviations of individual scores from
their means. In other words, it is the mean of the squared deviation of scores. Variance
is expressed as V = SD².
The merits of variance are:
1. It is calculated on all observations and hence is more accurate.
2. Any algebraic further calculations can be done on variance.
3. It is not affected by sampling fluctuations.
4. It is does not fluctuate easily.
The demerits of variance are:
1. Since it uses the raw scores directly it may be lengthy and tedious to calculate.
2. It gives greater weight to extreme values. Self-Instructional
Material 41
The square root of the variance is Standard deviation. It is a measure of the dispersion
or spread of the data around the mean. The term standard deviation was first used in
writing by Karl Pearson in 1894. Standard deviation is denoted by the symbol ‘’
(Greek letter sigma). This is most popular used method of variability. The standard
deviation indicates the average of distance of all the scores around the mean. The
mean with smaller standard deviation is more reliable than mean with large standard
deviation. A smaller SD shows greater homogeneity of the data.
The merits of SD are:
1. Since it is the best measure of variation, it is widely used.
2. It is calculated using all the observations of the data.
3. It gives an accurate estimate of population parameter when compared with
other measures of variation.
4. SD is stable and is not affected by sample fluctuations
5. It is also possible to calculate combined SD that is not possible with other
measures.
6. Further statistics can be applied on the basis of SD like, correlation, regression,
tests of significance, etc.
7. Coefficient of variation is based on mean and SD. It is the most appropriate
method to compare variability of two or more distributions.
The limitations of SD are:
1. SD gives more weight to outliers and extreme score.
2. It is difficult to compute as compared to other measures of dispersion.
Uses of Standard Deviation:
1. When more reliable and accurate measure of variability are needed then SD is
used.
2. It is used when further statistics like, correlation, regression, tests of significance,
etc. have to be computed.
Self-Instructional
42 Material
When range and Inter quartile range are compared, the range provides a rough idea of
the spread of the data, while the IQR and standard deviation provide more precise
measures of variability that are less sensitive to outliers. The variance is an intermediate
step in the calculation of the standard deviation and is used in many statistical tests and
procedures.
1. It is appropriate when the distribution contains few and very extreme scores.
2. It is used when the median is the measure of central tendency.
3. Also, it is used when our primary interest is to determine the concentration
around the median.
Here’s an example of how to calculate the range and semi-interquartile range for a
data set:
Suppose the data set is: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Note
that the set is already ordered in ascending order. Self-Instructional
Material 43
Here’s an example of how to calculate the variance and standard deviation for a data
set:
Suppose the data set is: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Note
that the set is already ordered in ascending order.
Variance: The variance is a measure of the spread of the data set. The formula
for the variance is:
Self-Instructional
44 Material Variance = (x - x)2 / n
where x is each value in the data set, x is the mean i.e. average of the data set, and n is NOTES
the number of values in the data set.
Mean of this dataset is:
(1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15)/15 = 8
Variance = ((1-8)2 + (2-8)2 +(3-8)2 +(4-8)2 +(5-8)2 ... + (15-8)2 ) / 15
= 280/ 15
= 18.6
Standard Deviation: The standard deviation is the square root of the variance. The
formula for the standard deviation is:
Standard Deviation = Variance
In this case, the standard deviation is 18.66 = 4.3197.
Let’s look at one more example: Dataset: [1, 4, 6, 6, 8, 9, 11, 12, 14, 15, 19,
34, 35]
Note that the set is already in ascending order. If that is not the case for the data
set given, always order the data values in ascending order.
Mean = 174 / 13 = 13.38
Variation = (x - mean)2 / n
= ((1-13.38)2 + (4-13.38)2 +(6-13.38)2 +(6-13.38)2 +... + (35-
13.38)2 ) / 13
= 1346.7 / 13 = 103.592
Standard Deviation = Variance
= 103.592 = 10.178
Here is another example of how to calculate the range, semi-interquartile range, variance,
and standard deviation for a data set:
Suppose the data set is [3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 12, 13, 15].
Mean = (3 + 4 + 4 + 5 + 6 + 7 + 8 + 8 + 9 + 10 + 11 + 12 + 12 + 13
+ 15) / 15 = 8.466 = 8.5
Variance = ( (3-8.4)^2 + (4-8.4)2 + ... + (15-8.4)2 ) / 15 = 187.75/15 =
12.52
Standard Deviation = Variance = 12.52 = 3.54 Self-Instructional
Material 45
NOTES
2.8 EFFECTS OF LINEAR SCORE
TRANSFORMATIONS ON MEASURES OF
VARIABILITY
In terms of comparison, the effect of linear score transformations on the measures of NOTES
central tendency and variability can be summarized as follows:
Mean and standard deviation are both affected by linear score transformations,
while median, mode, range, and semi-interquartile range are not affected.
The mean and standard deviation are more sensitive to outliers compared to the
median and semi-interquartile range.
The range and semi-interquartile range are measures of variability that are less
affected by outliers compared to variance and standard deviation.
In summary, the effects of linear score transformations on measures of variability
depends on the specific transformation applied, and it is important to understand
how the transformed scores will affect the results of any statistical analysis
performed on the data.
In-Text Questions
10. The dataset provides the weight of the bags filled by a worker in every 2
hours:
39,43,36,38,46,51,33,44,44,43.
Find the mode of this data set. Are there more than 1 mode? If so, why?
11. Calculate the range, IQR, variance and standard deviation for this dataset:
2,4,5,6,12,14,15,21,23,25
12. What is the mode of the data:
44, 42, 35, 37, 45, 50, 32, 43, 43, 40, 36, 44, 43, 44, 47
Does the data has more than one mode?
13. If Q1 = 10, Q2 = 12 and Q3 = 21, what is the IQR?
14. Range helps give a rough idea of how spread the data is:
a) True b) False
15. _____________ is the intermediate step in the calculation of standard deviation
16. Range and semi-interquartile range are measures of variability that are less
affected by outliers:
a) True b) False
17. Range, IQR, Variance and Standard Deviation are:
a) Measure of central tendency b) Measure of variability
Self-Instructional
c) Both d) None of the above Material 47
NOTES
2.9 SUMMARY
1. d
2. c
3. b
Self-Instructional
48 Material 4. b
2.11 GLOSSARY
NOTES
2.12 SELF-ASSESSMENT QUESTIONS
Q1. Find the mean, median, mode, and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13
Q2. Why will some dataset have multiple modes while some dataset has no mode?
Q3. Which measure of central tendency is affected by outliers and which is not
affected?
Q4. Define Median. How is median calculated for even set of values?
Q5. Explain semi-interquartile range. How is IQR calculated?
Q6. What do you understand by linear transformation? Which measures of tendency
and variability are not affected by linear transformations?
Q7. What do you understand by outliers?
Q8. Calculate the range, IQR, variance and standard deviation for the following
values:
10, 15, 25, 26, 26, 29, 30, 35, 40, 45
2.13 REFERENCES
Aron, A., Coups, E.J. & Aron, E.N. (2013). Statistics for Psychology (6th Ed.).
Pearson Education.
Asthana, H.S. & Bhushan, Braj (2007). Statistics for social sciences (with SPSS
applications). New Delhi: Prentice Hall of India.
Field, A. (2009). Discovering Statistics using SPSS (3rd Ed). New Delhi: Sage.
Garrett, H.E. (2005). Statistics in Psychology and Education. Paragon International
Publishers
Howitt, D. &Cramer, D. (2011). Introduction to Statistics in Psychology (5th Ed.).
Pearson Education.
Self-Instructional
Material 51
NOTES King, B.M., Rosopa, P.J., &Minium, E.W. (2011). Statistical Reasoning in the
Behavioral Sciences (6th Ed.).
Mangal, S.K. (2010). Statistics in Psychology and Education (2nd Ed.). PHI Learning.
Mohanty, B. & Misra, S. (2015). Statistics for behavioral and social sciences. New
Delhi: Sage Publications.
N.K. Chadha (2009) Applied Psychometry. Sage Pub: New Delhi
N.K. Chadha (1991) Statistics for Behavioral and Social Sciences. Reliance Pub.
House: New Delhi
N.K. Chadha and R.L. Sehgal (1984) Statistical Methods in Psychology, ESS
Publications: New Delhi
Self-Instructional
52 Material
LESSON 3 NOTES
STANDARD SCORES
Dr. Poonam Phogat
Associate Professor, Gargi College
University of Delhi
Email-Id: [email protected]
Dr. Shweta Chaudhary
Assistant Professor, Gargi College
University of Delhi
Email-Id: [email protected]
Structure
3.1 Learning Objectives
3.2 Introduction to Standard (z) Scores
3.3 Properties of z-scores
3.4 Transforming Raw Scores into z-scores
3.5 Determining Raw Scores from z-scores
3.6 Some Common Standard Scores
3.6.1 T-score
3.6.2 Stanine-score
3.6.3 STEN-score
3.7 Computations of Percentiles and Percentile Ranks from Grouped Data
3.7.1 Calculating Percentiles from Grouped Data
3.7.2 Calculating Percentile Ranks from Grouped Data
3.8 Comparison of z-scores and Percentile Ranks
3.9 The Normal Probability Distribution: Nature, Properties and Applications
3.9.1 Nature of the Normal Distribution
3.9.2 Properties of the Normal Distribution
3.9.3 Applications of the Normal Distribution
3.10 Normal Curve and Standard Scores
3.10.1 Finding Areas when the Score in known and Finding Scores when the
Area is known
Self-Instructional
Material 55
The standard score is the deviation of the score from the distributed mean of the
scores. It is also known as the z-score. The standard score or z-score tells us how far
a score is in a distribution from its mean. Z - score informs us how far in the score or
if values are greater than or less than the mean or average in standard deviation units.
A positive standard score means that the value of the score is larger than the given
mean, whereas a negative standard score means the value of the score is smaller than
the given mean in a distribution. In different types of psychological tests, standard
scores allow us to compare scores between two or more data sets. Standard score
helps us to gets an accurate and significantly consistent comparison in relation to the
mean or average of the score.
For example, if a teacher wants to assess a student’s performance in 2 different
subjects, Psychology and Education in their final exams. He cannot just compare both
subjects’ marks because each class may contain a different population size. Let’s say
Self-Instructional
56 Material
the particular student got 80 in Psychology and 65 in Education, the teacher just NOTES
cannot say that the student is performing better in Psychology compared to Education.
If most of the students from the Psychology class were getting marks around 80, then
the particular student’s performance is at par with the average performance of the
class, but if most of the students got lower marks in Education class in comparison to
this particular student then it would mean that the student is performing well in this
subject and he is one of the top scorers of that subject. It is for such scenarios that we
need to look at the standard score of the student in comparison to the entire class to
gain a realistic understanding of the student’s performance.
Since we have already discussed that standard scores are deviations from the
mean score of a particular distribution, it is important that the standard score mean
value is always zero, and their standard deviation is equal to one. The standard
score or z-score is always interpreted in terms of its positive or negative distances
from its mean value. These values help us understand and give an accurate picture
of the position of the scores in a particular dataset. Since we cannot just compare
two raw scores, we need to convert these raw scores to standard scores.
1. Converting raw scores into standard or z-scores in a data distribution does not
impact the characteristics of the distribution.
2. Standard score and z-score help us analyse and compare scores from two
different distributions.
1. In order to depict the positive and negative positions of the standard scores
from the mean, plus and minus signs are used. This can be confusing and
misleading.
2. Decimals used in standard score or z-score can create confusion.
In the section we will look at how to convert raw score into z-scores using a step-by-
step process. In order to convert a set of raw scores in a given dataset to a standard
score or z-score, we can follow the specific steps mentioned below-
1. First, we need to calculate the mean and standard deviation of a particular
distribution.
2. Then, we need to substitute the value of both the mean and standard deviation
in the given formula to get the standard score or z-score.
Z = X- M /
(X = Raw scores, M = Mean of the scores, and = standard deviation of the
scores)
Let’s look at an example to calculate the z-score:
There are two sections, namely, A and B in B.A. (H) Psychology second year
class. To test student achievement in social psychology paper two different exams
Self-Instructional
58 Material
were conducted for each of the section. Rita from Section A scored 70 marks while NOTES
Surabhi from Section B scored 80 marks. We would now try to determine which
student, Rita or Surbhi (who belong to different class sections) performed better in
social psychology exam.
Just by looking at both students’ marks, we cannot just say Surabhi performed
better than Rita because she got 80 in her social psychology paper compared to Rita
who got 70. The paper for section B might have been easier than section A or both the
question papers were fundamentally different wherein one was objective question
based and the other one was all descriptive questions. For this reason, it would be
unfair to compare the scores obtained by Rita and Surbhi as they don’t belong to the
same scale.
In order to do a correct comparison, we need to convert their raw score, which
is their exam marks into z-scores for comparison.
Mean and standard deviation of section A and section B are given are given below-
Section A - Mean - 50, standard deviation - 10
Section B - Mean - 70, standard deviation - 20
By using the z-score formula,
Z = R-M /
R = Raw scores obtained by the students.
M = Mean of the exam performances.
= standard deviation of the distribution of scores in the given test.
z-score for Rita = (70-50)/10
= 2.0
Similarly, z-score for Surbhi = 80-70/ 20
= 0.5
We can now compare the z-score and can conclude that Rita performed
better in social psychology with 2.0 z-score as compared to Surbhi who has a z-
score of 0.5
Self-Instructional
Material 59
NOTES Let’s look at another example. You score 180 on a test. The mean () for the
test was 140 and the standard deviation () was 20. Based on the assumption the
values had a normal distribution, the z-score for you would be:
z = (x – ) /
= (180-140)/20
=2
In the previous section we looked at how to calculate the z-score from the raw score.
In this section we are going to look at how to calculate the raw score if we have been
given the z-score. The formula for calculating the raw score from z-score is:
x = µ + (z x σ)
(Where µ equals the mean, z equals the z-score, and σ equals the standard
deviation)
Now let’s do it in an example:
Let’s say in Delhi, the mean/average income for a household is 500000 rupees
(annually) with the standard deviation of 6000 rupees. If a household has a z-
score of 2. Then what is the annual income of this household?
To solve this, we will use the raw score formula:
x = µ + zσ
(Where µ equals the mean, Z equals the z-score, and σ equals the standard
deviation)
X = 500000 + (2*6000)
= 500000 + 12000
= 512000
Therefore, the annual income of this household is 512000 rupees.
Self-Instructional
60 Material
Student A:
x= µ + zσ
= 75+ (-3)*6
= 75-18
= 57
Student B:
x= µ + zσ
= 75 + (1*6)
= 75 + 6
= 81
Some of the common standard scores are T-score, Stanine-score and sten scores
which are used to overcome some of the limitations of the z-score. We will discuss
these different T-scores in a detail in this section.
3.6.1 T-score
As mentioned earlier, the use of decimal points in the z-score can create confusion and
difficulty to interpret data distribution. To overcome the limitations of the standard
score or z-score a more reliable and useful score named T-score may be used. This
score was first used by William A. McCall. The score is called T name and is given in
honour of the renowned psychologists, Terman and Thorndike. Self-Instructional
Material 61
NOTES T-score is a type of standard score on a data distribution that has a mean of 50
and a standard deviation of 10. A T-score is almost similar to z-score but many people
prefer T-score because of the lack of negative numbers, which makes the interpretation
easy. Further, there are no decimal points in T-score which makes data distribution
less confusing and difficult.
To calculate T-score from raw scores, we use the formula given below:
T = 10z +50
(Where z is the z-score, remember to calculate the z-score we would need the
mean and standard deviation of the data set)
Therefore, the first step in the calculation of T- score is the analysis of the mean
and standard deviation of a given distribution. This would help us calculate the z -
score and use the above formula to calculate the T-score.
For example:
In the final year examinations of the psychology course, two students, namely,
Maya and Jia scored the following marks given below in the table. Out of these
two students whose overall score is better?
Table 1: Marks Scored by Maya and Jia in Social and Developmental
Psychology Final Examination with their Mean Value and Standard Deviation
Social psychology 70 80 60 10
Developmental 80 70 50 20
psychology
At the first glance, it may seem Maya’s and Jia’s marks are almost the same.
Both the students scored equal numbers by scoring 150 in total in the two subjects.
But we cannot conclude that both did equally well. Frequency distribution of the marks
scored by the two students are different. Mean value and standard deviation for both
subjects are different.
In order to compute who scored better we need to convert the raw scored
marks into T-score.
Self-Instructional
62 Material
Step - 1 NOTES
Conversion of Raw marks scored in two subjects by two students into T-score
(a) Social Psychology.
Maya’s T-score = 10z + 50
As we know z = (X - M )/ σ
= 10 (X - M)/ σ + 50
= 10* (70 - 60)/10 + 50
= 10*10/10 + 50
= 10 + 50
= 60
Jia’s T-score
T-score = 10z + 50
= 10 (X - M)/ σ + 50
= 10*(80 - 60)/10 + 50
= 10*20/10 + 50
= 20 + 50
= 70
(b) Developmental Psychology
Maya’s T-score: 10z + 50 =
= 10 (X - M)/ σ + 50
= 10*(80 - 50)/20 + 50
= 10*30/20 + 50
= 65
Jia’s T-score:10z + 50
= 10 (X- M)/ σ + 50
= 10*(70 - 50)/20 + 50
= 10*20/20 + 50
Self-Instructional
= 60 Material 63
NOTES Step - 2
In the second step, we will add both student’s T-score obtained in social and
developmental psychology.
Maya’s total T-score in two subjects: 60 + 65 = 125
Jia’s total T-score in two subjects: 70 + 60 = 130
Looking at Maya’s and Jia’s combined T-score, it has been revealed that Jiya
performed better in her final exams as compared to Maya.
As is true for each score, T-scores has its own advantages and disadvantages.
The advantages and disadvantages of T-score are discussed follows.
Advantages of T-scores:
Disadvantages of T-scores:
1. In order to get back the value of the true raw score, one needs to know about
the original raw scores mean and standard deviation value.
2. Sometimes fixed mean and standard deviation values can create difficulty in
analysis.
3.6.2 Stanine-score
The word stanine is used for standard nine numbers. Stanine-score consists of 9
categories, with a mean value of 5 and a standard deviation 2. Stanine-score can be
used to transform or convert any score into a nine-point single digit-score.
Same as z-score and T-score that we have discussed previously this scale is
used to assign a single digit number to a test-score relative to all other test-scores in
that particular group. As Stanine-scores are always whole numbers starting from 0-9,
Self-Instructional
64 Material
this can’t be expressed with negative scores. Likewise, a Stanine-score cannot be NOTES
expressed with decimals points either.
Stanine-scores are quite similar to normal distribution scores, we can use the
scores as a bell curve that has 9 divisions. These 9 divisions are numbers from 1 to 9,
starting from the left-hand side that is 1 and ending with 9.
For example, if a psychology teacher wants to convert her students’ performance
in their final year exam into a Stanine-score scale. She first has to convert the marks of
the students using the Stanine-score scale. For example, 10/100 is 1 Stanine-score,
whereas 90/100 is 9 in the Stanine-score scale. Here students who got 90 or above
90 marks were in the top 4% of the class whereas students who got only 10 marks out
of 100 were in the bottom 4% of the class in terms of performance in the psychology
subject.
Given below is the procedure to convert a raw test-score into a Stanine-score using
the previous example given above. Let’s say a psychology teacher wants to convert
100 students’ performance in the final exam into a Stanine-score scale.
First, the teacher needs to rank the scores from lowest value to the highest. In
the second step, the teacher will assign a Stanine-score to every student’s exam marks
using the Stanine scale.
Table 2: Stanine-score Ranking
Stanine-score 1 2 3 4 5 6 7 8 9
Self-Instructional
Material 65
10 1 Bottom 4%
20 2 Bottom 7%
30 3 Bottom 12%
40 4 Bottom 17%
50 5 Middle 20%
60 6 Top 17%
70 7 Top 12%
80 8 Top 7%
3.6.3 STEN-score
STEN-scores shortly used for the standard ten. If we take a scale and divide it into
10 parts or units, we can call it the STEN-score scale. However, all the 10 units will
not be equally divided into 10 even units of 10%. These units will be divided into bell
Self-Instructional curve shapes, where the majority of the scores from the test will lie in the middle
66 Material
portion, which will be the average range. And only 2% of the scores will lie in the NOTES
lowest and the highest range score in the STEN-score scale.
The mean for the STEN-score is 5.5 and a standard deviation value of
around 2.
Table 4: STEN-scores and Position in the Test Scale
3 or 4 Below average
5-7 Average
8 Above average
NOTES Here, values of STEN-scores are always rounded up, so in the above example
the ‘7.5’ STEN-score would be rounded up to 8 STEN-score and the ‘9.5’ STEN-
score would be rounded up to 10.
STEN-scores are easy to understand and interpretable, as it ranges from 1-10 scores.
They can be easily standardized to compare across different test-scores. As there is
no negative value, it is easy to understand and less complicated.
Scores in the STEN-score scale are not equally divided into 10 equal units. The size
of the units may vary according to the test. There are situations where a scale of 10
points is way too high, there may be instances or psychological tests where fewer
divisions of scale are more suitable.
Percentile (means hundred) is used to describe the position of a participant with respect
to a group and is based on cumulative frequency distribution percentage. There are
two important concepts related to percentile, percentile point and percentile ranks.
Percentile point is commonly referred to as percentile, represents a point below
which a specific number of cases fall. For example, in table 8, we can see that in verbal
reasoning section of CAT, 92 % of the candidates score below the score of 35.
Therefore, 92th percentile is 35. Percentile rank on the other hand represents, the
percentage of cases that falls below a point on the measurement scale. In our example
above, the percentile rank for the score 35 is 92.
Self-Instructional
68 Material
Although the definitions of percentile and percentile rank are similar, the difference
lies in that fact that percentile ranks can take values between 0 and 100 only. Whereas
percentile point can take any value that the scores can take. In our example, the
maximum value of the percentile point can be 50. Symbolically, percentile is represented
by P, 20th percentile as P20, 30th percentile as P30 and so on. Let us assume that one
of the candidates scored 25 in data interpretation, then it can be written as P74 = 25
(74% of the test takers scored below 25 in data interpretation). In terms of the percentile
ranks, the subscript here indicates the rank which is 74.
If we want to find 20th percentile that is P20, it implies finding a score below which
20 % of the cases fall. Since there are a total of 30 cases then 20% of 30 is 6,
therefore P20 is the point below which 6 cases fall.
In order to calculate this, we will start from the bottom of the distribution,
we can see that this point will fall in the interval 69.5-71.5 but we can’t be sure at
this point that what this score will be, the only point we can be sure of is that it falls
within this interval. Another observation about this interval is that there are a total
of 8 scores and we assume that all the scores are equally distributed in this interval.
Self-Instructional
Material 69
NOTES 71.5 5
2 cases
4
Width of
3 P20
the class
interval is 2
2 3 cases
1
Since there are three cases below the real limit of 69.5, we need to move up 3
more points to find out P20. In order to do that we need to cover 3 parts of the 5 equal
interval limit, we can do this by the following calculation: (3/5) × 2 = 1.2 points and
subsequently we will add this value to the lower limit to get P20.
P20 = (3/5) × 2 + 69.5 = 70.7
The entire process can be summarized in following steps:
Step 1: Find the class interval within which the P20 falls, this is done by:
I. Finding the score below which 20% of the scores fall
II. 20% of the total observation will be calculated, 20% of 30 = 6
III. Therefore, the 6th score from the bottom falls in the interval 69.5-71.5
Step 2: Determine the number of cases from the lower real limit of the interval to
where P20 will be, in this case 3.
Step 3: Assume that the class interval is equally distributed and find out the additional
points from the lower real limit from where the 6th score will fall using (3/5) X 2 = 1.2
Step 4: Finally add these points to the lower real limit to get to the percentile.
Self-Instructional
70 Material
Suppose we want to calculate the percentile rank for the score of 79. We can find this
out by first determining the interval within which this score falls, which is 77.5-79.5. In
order to get to 79, we need to add 1.5 to the real lower limit of the interval,
77.5+1.5=79. There are 2 scores in this interval and the interval width is also 2. Let us
assume that the two scores are equally distributed, we therefore must come up 1.5/2
X 2 = 1.5 cases from the bottom of the interval and since there are 25 scores below
the real lower limit, the point is 25 + 1.5 = 26.5 cases up from the bottom of the
distribution. Finally, 26.5/30 = .883 or 88.3%. Therefore, the score of 79 is at a point
below which 88.3% of the cases fall.
The entire process can be summarized mathematically as:
1.5
25 2 2
Percentile rank of the score 79 100 88.3
30
To better understand the process, let’s calculate percentile rank for the score of
74 using the above mathematical notation.
0.5
15 2 5
Percentile rank of the score 74 100 54.16
30
In-Text Questions
1. From the data given in table 7, calculate the following:
a) P40
b) Percentile rank for 82
Self-Instructional
Material 71
NOTES
3.8 COMPARISON OF Z-SCORES AND PERCENTILE
RANKS
NOTES
Scores obtained 30 50 60 75
Rank 1 2 3 4
Step - 2
By using the percentile formula given above, we will get the percentile rank of the
student who scored 60.
P = (n/N) *100
= (3/4) *100
= 75
Hence, the student who scored 60 is placed in the 75th percentile rank.
Few other statistical measurement techniques are quartiles, deciles and median.
We can understand the concept of percentile with the help of median. Median can be
defined as a measurement technique in statistics, which divides the scores into two
equal parts. In the median, 50% of the data lies below and the other 50% lies above
the medium point.
In terms of quartiles, we can define it as a four number series for a given score
distribution. This four number series is defined as 1st quartiles, 2nd quartiles, 3rd quartiles
and 4th quartiles. Like 4 quartiles, deciles are divided in 10 parts, described as 1st to
10th decile.
Percentile and percentile rank can be used in the field of social sciences and
humanities to indicate a particular score of an individual or an item’s position
with reference to other scores and items of the test or distribution.
To identify and rank students’ performances in various exams and various co-
curricular activities.
To rank and identify companies, organisations, institutions performances in
various field.
Self-Instructional
Material 73
NOTES
In-Text Questions
2. z-scores tell us how far below or above the score the value is in:
a) Mean units b) Standard deviation units
c) Range units d) Raw score units
3. z-scores can be taken as the common standard to compare different kinds of
datasets
a) True b) False
4. The standard deviation of the z-score is:
a) 0 b) 10
c) 1 d) 100
5. The z-score above 0 means
a) All sample values are equal b) Sample values are below the mean
c) Sample values are above the mean d) None of the above
6. What does a negative z-score imply?
a) Value of the score is equal to the mean
b) Value of the score is greater than the mean
c) Value of the score is less than the mean
d) All of above
The normal distribution is defined by two parameters: the mean () and the standard
deviation (). The mean determines the centre of the distribution, while the standard
deviation measures the spread of the data. The distribution is fully specified by these
two parameters, and once they are known, we can calculate probabilities for any
range of values.
The normal distribution has several important properties, including:
1. The distribution is symmetric around the mean.
2. The mean, median, and mode are all equal.
3. The total area under the curve is equal to 1.
4. The tails of the distribution extend infinitely in both directions.
5. The shape of the distribution is determined by the mean and standard deviation.
The normal distribution is widely used in statistics and scientific research, and is applied
in many fields including:
1. Quality control: The normal distribution is used to model the distribution of
values in a process, and to set control limits that can be used to detect when the
process is out of control.
2. Inferential statistics: Many statistical tests and models rely on the assumption
of normality, and the normal distribution is used to calculate probabilities and Self-Instructional
confidence intervals. Material 75
NOTES 3. Financial modelling: The normal distribution is used to model stock prices
and returns, and to calculate probabilities of different investment outcomes.
4. Epidemiology: The normal distribution is used to model the distribution of a
disease or illness in a population, and to estimate the probability of different
outcomes.
5. Psychology: The normal distribution is used to model many psychological
variables, such as IQ scores and personality traits.
In summary, the normal probability distribution is a fundamental tool in statistics
and scientific research, providing a way to model many real-world phenomena and to
make predictions about future outcomes.
3.10.1 Finding Areas when the Score in known and Finding Scores when
the Area is known
z-scores are usually populated in bell curve can be used to find the area covered under
the bell curve. Further, the bell curve area can be used to find the z-score.
If you are using a standard normal distribution table, find the row that corresponds NOTES
to the first digit of the z-score and the column that corresponds to the second digit of
the z-score. Then, look at the corresponding cell in the table, which will give you the
area to the left of the z-score.
Table 6: (Stu Z Table - University of Arizona)
For example:
If the z-score is 1.25, you would look in the row that corresponds to 1.2 (the first part
of 1.25) and the column that corresponds to 0.05 (the last part of 1.25). The
corresponding cell in the table would give you the area to the left of 1.25, which is
0.8944.
To calculate the area to the right of a given z-score, subtract the area to the left
of the z-score from 1.0. Using the example above, the area to the right of 1.25 would
be:
1.0 - 0.8944 = 0.1056.
Self-Instructional
Material 77
NOTES So, the area to the right of 1.25 is 0.1056 in proportion (out of 1) and 10.56
percent of the population.
Calculating z-score from the area:
Using the same standard normal distribution table, find the area in the table and look for
the corresponding z-score. The table will typically have values for the area to the left of
the z-score, so if you need to find the z-score for an area to the right of the mean, you’ll
need to subtract the area from 1 before looking up the corresponding z-score.
For example, suppose you want to find the z-score that corresponds to an area
of 0.95 to the left of the mean. Using a standard normal distribution table, you would
look for the row that corresponds to 0.9 (the first digit of 0.95) and the column that
corresponds to 0.05 (the second digit of 0.95). The value in the corresponding cell
would give you the z-score that corresponds to an area of 0.95, which is approximately
1.645.
Another example is that NPS can be used to determine percentage of individual
whose scores fall between two given scores.
Example:
If in a sample of 1000 cases, the mean is 14.5 and SD is 2.5. Assuming normality,
how many individuals scored between 12 and 16?
First convert both the raw scores into z-scores.
Z1 = (12-14.5)/ 2.5 = - 1
Z2 = (16-14.5)/ 2.5 = 0.6
Now, see the table for the areas between 0 and 0.6, it is found that 22.57
percent cases lie. We already know that between 0 and 1, 34.13 % cases lie. Hence
between both the points 12 and 16, total of 22.57 and 34.13%, that is 56.7 % of
cases lie. The total cases are 1000, hence 56.7% of 1000 is 567 individuals lie between
these two points.
Self-Instructional
78 Material
NOTES
In-Text Questions
7. What is one advantage of T -score?
a) It cannot be used to interpret data
b) There are no negative or decimal numbers
c) There are no advantages of T-score
8. Calculate T-score for a value with a z-score of 0.5.
9. Calculate raw score for a value with a mean of 50, z-score of 0.5 and standard
deviation of 2.
10. What is one disadvantage of STEN and Stanine-scores?
11. Percentile rank range is from:
a) 1-10
b) 1-59
c) 0-99
d) 1-99
3.11 SUMMARY
Self-Instructional
Material 79
NOTES
3.12 ANSWERS TO IN-TEXT QUESTIONS
1. a) P40 = 72.74
0.5
28 1 2
b) Percentage rank of the score 82 100 96.66
30
2. a
3. a
4. b
5. c
6. c
7. b
8. 55
9. 51
10. The scales are not equally sized.
11. d
3.13 GLOSSARY
STEN-score: It is a standard score in which the tesT-scores are scaled on ten NOTES
points o the normal scale.
T-score: It is a normalized standard score with a mean of 50 and SD of 10
points.
z-score: It is a normalized standard score with a mean of 0 and SD of 1 point.
3.15 REFERENCES
Aron, A., Coups, E.J. & Aron, E.N. (2013). Statistics for Psychology (6th Ed.).
Pearson Education.
Asthana, H.S. & Bhushan, Braj (2007). Statistics for social sciences (with SPSS
applications). New Delhi: Prentice Hall of India.
Field, A. (2009). Discovering Statistics using SPSS (3rd Ed). New Delhi: Sage.
Garrett, H.E. (2005). Statistics in Psychology and Education. Paragon International
Publishers/
Self-Instructional
82 Material
Self-Instructional
Material 83
LESSON 4 NOTES
ANALYSIS OF RELATIONSHIPS
Dr. Deepesh Rathore
Assistant Professor
Department of Psychology
Lakhsmibai College, University of Delhi
Email-Id: [email protected]
Structure
4.1 Learning Objectives
4.2 Introduction
4.3 Understanding Correlation
4.3.1 Scatter Diagram
4.3.2 Components of Correlation: Direction and Magnitude
4.3.3 Meaning of Correlation
4.4 Calculating Pearson’s Correlation
4.5 Correlation and causation
4.6 Effects of Linear Score Transformations
4.7 Factors Influencing Correlation
4.8 Spearman Rank Correlation Method
4.9 Linear Regression Analysis/Simple Regression
4.10 Summary
4.11 Answers to In-Text Questions
4.12 Glossary
4.13 Self-Assessment Questions
4.14 References
4.15 Suggested Readings
4.2 INTRODUCTION
Charles Darwin was a famous scientist who worked on the concept of evolution of
species through natural selection. In his study he identified that there are variations
among species that helps them in adapting to their environment and thus ensuring their
survival. This finding inspired Francis Galton (Darwin’s cousin) to carry out researches
on individual differences. He wanted to understand the role of inheritance on the stature
of the children, for this he collected data of the height of the parents as well as their
offspring, he then tabulated this data.
Table 1: Illustration of Galton’s Data (Not the Original Data)
Self-Instructional
88 Material
In-Text Questions
1. _______is a distribution that shows the relationship between two varibales.
2. _______ is considered as a statistical technique used to understand the level
of association between variables.
Self-Instructional
Material 89
NOTES Table 2: Scores Obtained by the Students on their IQ Test and CGPA
9 97 5.5
10 101 7
11 103 8
12 122 9
13 111 8
14 100 8
15 98 7.5
16 105 7
17 109 8.5
18 110 8.5
19 118 9
20 100 8
We can also draw a line passing through the cluster of dots on the diagram.
This line indicates the nature of relationship between the two variables, which is
linear in nature. This implies that as the IQ scores increases the CGPA also
increases.
Self-Instructional
90 Material
NOTES
CGPA
But this is not always the case, there are a lot of different variables that don’t
share a linear relationship, which implies that we cannot draw a straight line
connecting or hugging points on the scatter plot, instead what we can do is, use
curved lines to connect the dots and hence we call the relationship as curvilinear, as
shown in figure 2.
Happiness
Keep in mind, that the discussion about the prediction from correlation is only
applicable when the relationship between the variables is linear not curvilinear.
In order to draw a scatter plot, we cause the following steps:
Step 1: We start by first assigning label X and Y to the variables. Self-Instructional
Material 91
NOTES Step 2: Next, we plot values of the variables on the x-axis and y-axis, starting with
lower values from left to higher values to the right on the x-axis and for y-axis, starting
from lower values at the bottom to the higher values at the top.
Step 3: After plotting the values, we will find values of Y for the corresponding values
of X and mark the intersection with a dot.
Step 4: Repeat step 3 for all the values.
Step 5: Name each axis and add the title of the graph.
Variable X
Figure 3: Scatter Plot Represent Perfect Positive Correlation
Self-Instructional
92 Material
But on the other hand, the opposite can also be the case, that is as the value of one NOTES
variable increases, the value of the other variable starts decreasing. Thus, forming a
downward trend line, from upper left corner to the lower right corner shown in figure 4.
Variable Y
Variable X
Figure 4: Scatter Plot Represent Perfect Negative Correlation
Finally, in figure 5 we can see that there is no clear relationship between the
variables. In terms of the direction, there is no clarity, as higher values of one variable
are related to the higher as well as lower values of the other variable. Hence creating
a non-directional scatter plot.
Variable Y
Variable X
Figure 5: Scatter Plot Represent no Correlation
Self-Instructional
Material 93
NOTES So far, we have talked about one of the components of correlations, which is
direction, now we will discuss the second component, which is magnitude. Magnitude
implies the strength of the relationship between the variables. As we have previously
discussed, the value of r ranges between -1 to +1, in this the signs (+ or -) that is
positive or negative indicates the direction of the relationship. On the other hand, the
values represent the magnitude of the relationship. It implies that if the value of r is
closer to ±1, higher will be the correlation coefficient, irrespective of the direction. For
example, a correlation of +0.70 is same in strength as -0.70, the difference is in the
direction, one is positive and the second is negative.
Variable Y
(a)
Variable X
Variable Y
(b)
Variable X
Self-Instructional Figure 6: (a) Scatter Diagram Showing High Positive Correlation
94 Material
(b) Moderate to Low Positive Correlation
In this section we will try to find out answer to the question ‘what does it mean when
we say correlation between variable X and Y is r = +.85?’. Before answering this
question there are few aspects that needs to be clear about correlation. Firstly,
correlation represents a degree of linear relationship between variables, it doesn’t
mean that one variable is causing changes in the other variable. Secondly, when we
compare correlation coefficients, we cannot say r = +.80 is twice as large as r
=+.40, which implies that correlation coefficients should not be confused as
representing percentages. If this is the case, then in what way the degree or magnitude
of difference in correlation coefficient should be interpreted? The answer can be
obtained by looking at the percentage of cases falling above the median on the 1st
variable and percentage of cases obtained above or below the median on the 2nd
variable (Michael, 1966).
Table 3: Representing Meaning of Correlation w.r.t the
Percentage of Cases for the 2nd Variable
NOTES that all the cases above the median on IQ will also be above the median on CGPA.
On the other hand, if there is no correlation (r = 0.00), then it means that only
50% of those who are above the median on IQ will also be above the median on
CGPA.
In-Text Questions
3. _______is also known as scatter plot.
4. We can use linear relationship in a scatter plot. (True/False)
5. +1 represents a _______type of correlation.
6. _______implies strength of relationship between variables.
So far, we have discussed about what correlation is, how the relationship between
variables can be plotted using scatter plot. In this section we are going to see how we
can calculate correlation coefficients between variables. One of the mostly widely
accepted and used correlation coefficient is the Pearson’s product moment correlation
coefficient.
Pearson’s correlation coefficient can be calculated using two methods:
In this method, we first convert the raw scores of both the variables into z-scores and
then we calculate the sum of the product of each pair of scores, this is also known as
the cross-products and then dividing it by the total number of pair of scores.
( Z X ZY )
r .........................(a)
n
In this method, using deviation score formula, we can directly calculate Pearson's
Self-Instructional correlation coefficient using raw scores. Here, we will first calculate the sum of the
96 Material
products of deviation scores for each pair of scores and then we will divide the result NOTES
by the product of number of pairs of scores and standard deviations of each variable.
Mathematically,
( X X )(Y Y )
r
nS X SY
( X X ) 2 (Y Y ) 2
SX , SY
n n
( X X )(Y Y )
r
( X X )2 (Y Y ) 2
n
n n
( X X )(Y Y )
r
( X X ) 2 (Y Y ) 2
We know that,
(X ) 2
SS X ( X X )2 X 2
n
(Y ) 2
SSY (Y Y ) Y
2 2
( X X )(Y Y )
r .........................(b)
( SS X )( SSY )
(X )(Y )
( X X )(Y Y ) XY n
Self-Instructional
Material 97
NOTES Let’s understand how to calculate r using raw score formula with the help of an
example.
Table 4: Raw Score Method Solution
(X ) 2 (124) 2
SS X X 2 1618 80.4
n 10
(Y ) 2 (88) 2
SSY Y 2 876 101.6
n 10
( X X )(Y Y )
r
( SS X )( SSY )
(X )(Y )
XY
r n
( SS X )( SSY )
79.8
Self-Instructional = .88
98 Material 80.4 *101.6
The calculation of r using raw score formula can be summarized in the following steps: NOTES
Step 1: After writing the raw scores, calculate the following,
X, Y, X 2, Y 2, and XY
Step 2: Calculate the values for the sum of squares for both the variables,
SSX, SSY and ( X X )(Y Y )
Step 3: Substitute the values calculated in step 1 and 2 in the formula (b).
In-Text Questions
7. Following data represents the scores obtained by psychology students on
their level of motivation and self-esteem.
Motivation (X): 12, 16, 14, 12, 13, 20, 24
Self-esteem (Y): 5, 8, 7, 4, 3, 10, 12
Calculate Pearson's correlation coefficient using deviation score formula.
8. In _____ method we convert raw scores into z scores.
9. In _____ method we can directly calculste Pearson’s correlation coefficient
usinf raw scores.
One of the most important aspects of correlation is that if there is correlation between
two or more variables then it only means that the variables are associated with each
other but this association or shared variation between the variables does not mean that
one variable is causing changes in another variable i.e., correlation doesn’t imply
causation. It means that there is only association between the variables and not a
cause-and-effect relationship. For example, if there is a study that claims that use of a
new medicine is positively related with improvement in diabetes. Based on this
association can we conclude that the medicine is effective in treatment of diabetes?
The answer is, it is possible that the medicine is effective in treatment of diabetes but
Self-Instructional
Material 99
NOTES we cannot be sure about this conclusion because we have not taken into account for
the possibility of whether the patients were engaged in active lifestyle like exercising,
walking, eating healthy, eating less sugar, and carbohydrates, their age, weight, etc. All
these variables may influence the relationship between the medicine and improvement
in diabetes.
This does not mean that correlation is not important or that the variables does
not influence each other. They may influence each other directly or indirectly but this
acts a starting point for further studies. When we say one variable (X) cause change in
another variable (Y), it means that on the basis of X we can predict Y. Hence, we can
say that correlation is involved in establishing association between variables not
prediction, which is possible on the basis of cause-and-effect relationship between the
variable.
Self-Instructional
100 Material
NOTES
4.7 FACTORS INFLUENCING CORRELATION
1. Sample size: When the sample size is small, then the correlation coefficient is
slightly unstable, but as the sample size increases, the correlation coefficient
becomes more reliable.
2. Nature of sample: Correlation coefficient between variables changes as we
change the sample. In other words, correlation between two variables is not
fixed, it depends upon the sample that we collect, different samples results in
different correlation coefficients.
3. Linear relationship: Correlation coefficient as a measure of relationship
between variables is appropriate only when the nature of relationship between
the variables is linear in nature, as shown in figure 1, where a straight line can be
drawn, connecting most of the dots on the scatter plot.
4. Variability of scores: When there is high variability in the score distribution
among the variables, then correlation coefficient reduces. On the other hand,
when the variability is less, that is scores are concentrated close together, then
correlation coefficient increases.
5. Discontinuity in scores (missing values): When there are missing values in
one variable or both variables, then correlation coefficient overestimates the
strength of relationship between the variables. In other words, value of correlation
coefficient increases because of missing values.
In-Text Questions
10. Correlation implies causation of two variables. (True/False)
11. Linear transformation can be acquired by converting raw scores into standard
scores. (True/False)
12. When there are missing values in one variable or both variables it is called
_______.
Self-Instructional
Material 101
NOTES
4.8 SPEARMAN RANK CORRELATION METHOD
Spearman rank correlation is a statistical measure that evaluates the strength and direction
of association between two variables. Unlike Pearson correlation, which assesses
linear relationships, Spearman correlation is based on the ranks of the data points
rather than their actual values. It’s particularly useful when dealing with ordinal or non-
normally distributed data.
The formula for calculating the Spearman rank correlation coefficient (ρ) is as
follows:
The following steps are adopted for calculating Spearman Rank Correlation:
Step 1: Ranking the Data: First, the data for both variables are ranked
independently from lowest to highest. If there are ties (i.e., identical values), the ranks
are averaged. Let X and Y be two variables at ordinal data level. Let rank X represent
the order in which the values of X occur, and likewise rank Y represent the
corresponding order in which values of Y occur. Each value of X is associated with a
value of Y – they form pairs of values.
Step 2: Calculating the Differences in Ranks: Next, the difference between the
ranks of each paired observation is calculated. These differences represent the deviations
from the perfect correlation.
Step 4: Summing the Squared Differences: The squared differences are then
summed across all pairs of observations. The following formula will be used:
Sum of squared observations:
Step 5: Calculating the Spearman Rank Correlation Coefficient: Finally, the
Spearman correlation coefficient (often denoted by the symbol ρ) is calculated using a
formula that incorporates the sum of squared differences and the sample size.
Self-Instructional
102 Material
NOTES
Where:
represents the difference between the ranks of paired observations.
n is the number of paired observations.
This coefficient ranges from –1 to 1, where:
ρ = 1 indicates a perfect positive monotonic relationship (i.e., as one variable
increases, the other variable also increases).
ρ = –1 indicates a perfect negative monotonic relationship (i.e., as one variable
increases, the other variable decreases).
ρ = 0 indicates no monotonic relationship between the variables.
Spearman rank correlation is robust to outliers and does not assume linearity or
homoscedasticity, making it applicable to a wide range of data types. It’s commonly
used in various fields such as psychology, sociology, economics, and biology to explore
relationships between variables that may not conform to the assumptions of parametric
tests. However, it’s important to note that Spearman correlation does not imply causation;
it merely measures the strength and direction of association between variables.
Questions 1: A researcher wants to investigate the relationship between the number
of hours spent studying and the exam scores of 8 students. The data collected is as
follows:
Hours Exam
Studied Score
Student (X) (Y)
1 5 70
2 7 85
3 4 60
4 6 75
5 3 55
6 8 90
7 6 80
8 4 65
Self-Instructional
Calculate the Spearman rank correlation coefficient for this data. Material 103
NOTES Solution:
Rank the data for both variables:
Hours Exam
Studied Rank Score Rank
Student (X) (X) (Y) (Y) di=RXi−RYi di2
1 5 4 70 4 0 0
2 7 7 85 7 0 0
3 4 3 60 2 1 1
4 6 5 75 5 0 0
5 3 2 55 1 1 1
6 8 8 90 8 0 0
7 6 5 80 6 -1 1
8 4 3 65 3 0 0
Sum 0 3
Question 2: Given the following ranks of two variables, calculate the Spearman rank
correlation coefficient:
Variable X: 2, 4, 6, 8, 10
Variable Y: 5, 10, 15, 20, 25
Solution:
di=RXi
RXi RYi −RYi di2
2 5 -3 9
4 10 -6 36
6 15 -9 81
8 20 -12 144
10 25 -15 225
Sum 495
Self-Instructional
104 Material
Q1. Eleven candidates appeared for railway exams and their scores on reasoning
tests and aptitude tests are provided below. Calculate Spearman’s rank
correlation.
A 20 30
B 50 60
C 28 50
D 25 40
E 70 85
F 90 90
G 76 56
H 45 82
I 30 42
J 19 31
K 26 49
Q2. The following are the scores of 12 students in Physics and Maths. To what
extent is the knowledge of students in the 2 subjects related?
Student A B C D E F G H I J K L
Maths 80 45 55 56 58 60 65 68 70 75 85 90
Physics 82 86 50 48 60 62 64 65 70 74 90 75
Self-Instructional
Material 105
NOTES Q3. The following are the ranks obtained by 10 candidates in history and political
science entrance tests. Is there a relation between the candidate’s knowledge in
the two subjects. If yes, what is the extent of relation?
Rank in Pol. 4 1 6 7 5 8 10 9 2 3
Science
Sl. No. X Y
1. 12 21
2. 15 25
3. 24 35
4. 20 24
5. 8 16
6. 15 18
7. 20 25
8. 20 16
9. 11 16
10. 26 38
Q5. Following are the ranks obtained by players in two online video games. Determine
whether their skill in both the games are related and to what extent.
Player A B C D E F G H I J
Rank in 1 2 3 4 5 6 7 8 9 10
Game 1
Rank in 6 7 5 10 3 9 4 1 8 2
Self-Instructional Game 2
106 Material
NOTES
4.9 LINEAR REGRESSION ANALYSIS/SIMPLE
REGRESSION
Where:
Y is the dependent variable.
X is the independent variable.
b 0 is the intercept (the value of Y when X=0).
1 is the slope (the change in Y for a one-unit change in X).
is the error term, representing the difference between the observed and
predicted values of Y.
The parameters 0 and 1 are estimated from the data using a method such as
ordinary least squares (OLS). OLS minimizes the sum of the squared differences
between the observed and predicted values of the dependent variable.
Steps involved in performing linear regression:
1. Data Collection: Collect data on the dependent and independent variables
of interest.
2. Data Exploration: Explore the data to understand the relationship between
the variables, check for outliers, and assess the assumptions of linear
regression.
3. Model Building: Choose the appropriate regression model (simple linear
regression, multiple linear regression, etc.) based on the number of
Self-Instructional
independent variables and the nature of the relationship. Material 107
NOTES 4. Parameter Estimation: Use statistical techniques like OLS to estimate the
parameters (intercept and coefficients) of the regression model.
5. Model Evaluation: Evaluate the goodness of fit of the model using measures
such as the coefficient of determination (R2), adjusted R2, and residual
analysis.
6. Prediction and Inference: Use the fitted regression model to make
predictions about the dependent variable for new or unseen data. Additionally,
conduct hypothesis tests and confidence interval estimation for the regression
coefficients to make inferences about the relationship between the variables.
Linear regression is widely used in various fields, including economics, finance,
social sciences, engineering, and machine learning. It provides a simple yet powerful
tool for analyzing and predicting the behavior of continuous variables based on their
relationships with other variables. However, it’s essential to assess the assumptions of
linear regression and interpret the results with caution, especially in the presence of
non-linear relationships or influential outliers.
Q1. A company manufactures an electronic device which can be used in a wide
range of temperatures. The company knows that increased temperature shortens
the lifespan of the device. The study is conducted where lifespan of device is
determined as a function of temperature and the following data is found:
x 5 2 6 8 9 3 7
y 3 7 4 10 5 6 4
Self-Instructional
108 Material
Q3. A science teacher recorded the length of time, y minutes, taken to travel to NOTES
school when leaving home x minutes after 8 am on each day of the week. The
results are as follows:
x 0 10 20 30 40 50 60
y 18 27 28 39 39 48 51
4.10 SUMMARY
NOTES Correlation coefficient can take values between -1 to +1, where, -1 represents
perfect negative correlation, +1 as perfect positive correlation, and 0 represents
no correlation.
Correlation does not imply causation. It means that there is only association
between the variables and not a cause-and-effect relationship.
Linear transformation involves changes in each raw score by adding a constant,
subtracting a constant, multiplying a constant, or dividing by a constant. All
these changes in the raw score does not influence the value of the correlation
coefficients.
1. Bivariate distribution
2. Correlation
3. Scatter diagram
4. False
5. Perfect positive
6. Magnitude
7. r = +.94
8. Standard score
9. Deviation score formula
10. False
11. True
12. Discontinuity
Self-Instructional
110 Material
NOTES
4.12 GLOSSARY
4.14 REFERENCES
NOTES
4.15 SUGGESTED READINGS
Aron, A., Aron, E.N., & Coups, E.J. (2007). Statistics for Psychology (4th Ed.).
Delhi: Prentice Hall of India.
Howitt, D and Cramer, D. (2011). Introduction to Statistics in Psychology. London,
UK: Pearsons Education Ltd.
Garrett, H.E (2005). Statistics in Psychology and Education. Delhi: Cosmo
Publications.
King, B.M. & Minium, E.W, (2007). Statistical Reasoning in the Behavioral Sciences
(5th Ed.). Noida: Wiley.
Mangal, S.K. (2012). Statistics in Psychology and Education (2nd Ed.). Delhi:
Prentice Hall of India.
N.K. Chadha (2009) Applied Psychometry. Sage Pub: New Delhi.
N.K. Chadha (1991) Statistics for Behavioral and Social Sciences. Reliance Pub.
House: New Delhi.
N.K. Chadha and R.L. Sehgal (1984) Statistical Methods in Psychology, ESS
Publications: New Delhi.
Self-Instructional
112 Material
DSC-06
DEPARTMENT OF DISTANCE AND CONTINUING EDUCATION DEPARTMENT OF DISTANCE AND CONTINUING EDUCATION
UNIVERSITY OF DELHI UNIVERSITY OF DELHI