Correlation Coefficient
Correlation Coefficient
CORRELATION ANALYSIS
Concept of Correlation :
Correlation refers to the relationship between two or more variables. Simple correlation
studies the relationship between two variables. Correlation analysis attempts to
determine the degree of relationship between variables.
If the change in one variable is accompanied by the corresponding change in the other
variable, then we say that there is correlation between the given two variables. In other
words, these two variables are said to be correlated variables.
Types of Correlation
Example : Correlation between (i) sales and expenditure on advertisement; (ii) Income
and Expenditure of Households; (iii) Production and Labour (iv) Heights and weights of
students etc., are the cases of positive correlation.
(b) Negative Correlation : If the increase change in one variable is accompanied by the
corresponding decreasing change in other variable; or if the decrease change in one
variable is accompanied by the corresponding increase change in other variable, then we
say that there is Negative Correlation In this case the two variables more in the opposite
direction.
Examples : Correlation between (i) Price and Demand of commodities; (ii) Sales and
Prices of Commodities; (iii) Volume and pressure of a perfect gas; (iv) sales of sweaters
and day temperature etc., are the cases of negative correlation.
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
(i)
X 2 4 5 6 8 9
Y=3X 6 12 15 18 24 27
(ii)
X 1 2 3 4 5
Y=2+5X 7 12 17 22 37
(i)
X 1 2 5 8 9
Y=-2X -2 -4 -10 -16 -18
(ii)
X 1 2 3 4 5
Y=5 - 8X -3 -11 -19 -27 -35
The correlation between only two variables is called ‘simple correlation’. The concern
study is known as ‘simple correlation analysis’. In the study of correlation, when more
than two variables are involved then correlation may of either partial or multiple
correlation. Partial correlation is the correlation between any two variables in a group of
more than two variables, where, the linear effects of other variables on then have been
eliminated from both variables separately.
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
When we consider three or more variables at a time, multiple correlation reveals the joint
effect of a group of variables on a specified variable, which is not included in that group.
The following are different methods for studying correlation between two variables,
which are frequently used in practice :
1. Scatter diagram
2. Karl Pearson’s coefficient of correlation (r) (read ‘’ as row)
3. Spearman’s rank correlation coefficient ()
4. Kendall’s coefficient of concurrent deviations (r) (Read ‘’ as
Tou)
Scatter Diagram
Scatter diagram is the simplest way of graphic representation of a bivariate data, where
the given set of ‘n’ pairs of observations on two variables X and Y say
(X1, Y2), (X2, Y2) …(Xn, Yn) may be plotted as dots by considering X-values on X-axis
and Y-values on Y-axis. By scatter diagram, we can get some idea about the correlation
between X and Y.
In a scatter diagram, it the points are closer and show either upward or downward trend
then there is high degree of correlation. Other wise, if the points are scattered and do not
show any trend then there is no correlation or low degree of correlation between
variables.
Scatter diagram shows the direction of correlation but it can not measure the degree of
correlation.
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
The above scatter diagrams give some idea about the various types of correlation.
Cov( XY )
i.e rXY
σXσY
Consider a set of ‘n’ pairs of observations (X1, Y1), (X2, Y2), … (Xn, Yn) on two variables
X and Y. Then we have, Covariance between X and Y
Cov( XY )
( X X ) (Y Y ) or
XY
( X )(Y )
n n
1
= XY
X Y
n n
X X X 2 2
2
Standard Deviation of X : = σ X = or X
n n
=
X
X 2
1
2
n n
Standard Deviation of Y : = σ Y
Y Y 2
Y 2 2
Y
= or
n n
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
=
Y
Y 2
1
2
n n
Where, X
X , Y
Y and n= Number of pairs of observations by substituting
n n
these expansions, we get the following formula :
( X X ) (Y Y ) XY n X Y )
X X Y Y X nX Y nY
rXY = (OR) rXY =
2 2 2 2 2 2
X Y
XY n
rXY =
X
X
2
Y 2
2
n
Y
n
2
1.The value of Karl Pearson’s Correlation coefficient lies between -1 and +1.
i.e., -1 rXY 1
In other words, the limits of correlation coefficient are 1.
If rxy = + 1, then we can say that there is perfect positive correlation between X and Y.
If rXY = -1. Then we can say that there is perfect negative correlation between X and Y.
If rXY is greater than 0.5. then we can say that there is a high degree of positive
correlation between X and Y.
If rXY is greater than 0.5 with negative sign then we can say that there is high degree of
negative correlation between X and Y.
Similarly we can say about less degree of positive correlation and less degree of negative
correlation based on the value of rxy (i.e., r 0.5 and r 0.5 with negative sign).
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
2. If rxy=0 then the given two variables X and Y are called uncorrelated variables.
SOLVED PROBLEMS
Problem Find the coefficient of correlation between the ages of brothers (X) and ages of
sisters (Y) from the following data :
X 23 27 28 28 29 30 31 33 35 36
Y 18 20 22 27 21 29 27 29 28 29
Solution : Given,
X Y X2 Y2 XY
23 18 529 324 414
27 20 729 400 540
28 22 784 484 616
28 27 784 729 756
29 21 841 441 609
30 29 900 841 870
31 27 961 729 837
33 29 1089 841 957
35 28 1225 784 980
36 29 1296 841 1044
300 250 9138 6414 7623
Total
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
X Y
XY n
rXY =
X
X
2
Y 2
2
n
Y
n
2
7623
300 250
10 7623 7500
= =
300 6414 - 250 2
2 9138 9000 6414 - 6250
9138
10 10
Problem : Find the correlation coefficient between X and Y using the following data;
X Y
XY n
rXY =
X
X
2
Y 2
2
n
Y
n
2
508
30 40
25 460
rXY = = = 0.9329
30 2 460 40 2 493.0963
650
25 25
rXY = 0.9329
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
Then, we can reduce the original data consists of larger values on X and Y
into deviations data with smaller values in U and V.
U V
UV n
rXY = rUV =
U
U
2
V 2
2
n
V
n
2
Problem : Calculate correlation coefficient between the ages of brothers (X) and
ages of sisters (Y) from the following data.
X 23 27 28 28 29 30 31 33 35 36
Y 18 20 22 27 21 29 27 29 28 29
X Y U=X-A V=Y-B U2 V2 UV
23 18 -5 -11 25 121 55
27 20 -1 -9 1 81 9
28 22 0 -7 0 49 0
28 = A 27 0 -2 0 4 0
29 21 1 -8 1 64 -8
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
30 29=B 2 0 4 0 0
31 27 3 -2 9 4 -6
33 29 5 0 25 0 0
35 28 7 -1 49 1 -7
36 29 8 0 64 0 0
( U ) ( V )
UV - n
rXY ruv
U -
2 U 2
V -
2 V2
n n
(20) (-40)
43
rXY ruv 10
(20)
2
(40) 2
178 - 324 -
10 10
43 80 123 123
= = 0.8176
(178 - 40) (324 - 160) (138) (164) 150.4394
This shows that there is high degree of positive correlation between brothers and sisters.
X-Series Y-Series
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
rXY = 0.8905
Problem : From the data given below, find the correlation coefficient. xy=130,
x =90, y2 = 640and n=10, Here x and y are deviations from arithmetic means.
2
Solution : Given,
xy = (X - X ) (Y - Y ) = 130
x2 = (X - X )2 = 90
y2 = (Y - Y )2 = 640, n = 10
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
Hence,
(X - X) (Y - Y ) 130 130 130
( X X ) (Y Y )
rXY = 0.5417
2 2
(90) (640) 57600 240
rXY = 0.5417
U V
UV n
rUV
U
U
V
2 2
2
n
V
n
2
60
40 0
10 460 60
r =
40 2 215 0 2 20 X 215 65.5244
180
10 10
=0.915
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
It was however, later discovered at the time of checking that two pairs of
observations were not correctly copied. They were taken as (8,10) and (12,
77) while the correct values were (8,12) and (10,8). Obtain the correct value
of correlation coefficient between X and Y.
Solution: Given, n = 30, X = 120, X2 = 600, Y = 90, Y2 = 250, XY = 356
Now, we have,
Corrected X = 120 – 20 + 18 = 118
Corrected Y = 90 – 17 + 20 = 93
Corrected X = 600 – 208 + 164 = 556
2
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
X Y
XY n
Corrected rXY
X
X
2
Y 2
2
n
Y
n
2
368
118 93
= 30
118
2
93
2
556 309
30 30
2.2 2.2
rXY = 0.504
1901 .6407 43.6078
1 r 2
Remark : The expression is known the standard error (S.E.) of sample
2
correlation coefficient. The details of this concept will be discussed is the later chapter.
The probable error of correlation coefficient determines the limits [r P.E. (r)] with in
which the population correlation coefficient may be expected to lie.
Solution :
Exports X Imports Y X2 Y2 XY
42 56 1764 3136 2352
44 49 1936 2401 2156
58 53 3364 2809 3074
55 58 3025 3364 3190
89 65 7921 4225 5785
98 76 8604 5776 7448
66 58 4356 3364 3828
Total 452 415 31970 25075 27833
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
X Y
XY n
r
X
X
2
Y 2
2
n
Y
n
2
27833
452 415
= 7
452
2
415
2
31970 25075
7 7
27833 26797.143
r
31970 - 29186.286 25075 - 24603.571
1035 .8571 1035.8571
=
2783 .7143 471.4286 1145.5665
r = 0.9042
1 r 2 1 (0.9042) 2 0.1824
P.E.(r) = 0.6745 = 0.6745 = 0.0465 = 0.0465
n 7 2.6458
P.E.(r) = 0.0465
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
Rank correlation coefficient is useful for finding correlation between any two qualitative
characteristics such as Beauty, Honesty, Intelligence etc., which can not be measured
quantitatively but can be arranged serially in order of merit or proficiency possessing the
two characteristics.
Suppose we associate the ranks to individuals or items in two series based on order of
merit, the Spearman's Rank correlation coefficient is given by
6 d2
1- 2 [Read the symbol ( as 'Rho'.]
n(n - 1)
Where,
d2 = Sum of squares of differences of ranks between paired items in two
series
n = Number of paired items
Remarks
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
In any series, if two or more observations are having same values then the observations
are said to be tied observations. If tie occurs for two or more observations in a series, then
common ranks have to be given to the tied observations in that series; these common
ranks are the average of the ranks, which these observations would have assumed if they
were slightly different from each other and the next observation will get the rank next to
the rank already assumed.
In the case of data with tied observations, the Spearman's rank correlation coefficient is
given by
6 Adj d 2
1 -
2
n(n - 1)
Where
S13 S1 S 2 3 S 2 S 3 3 S 3
Adj
d 2
d 2
12
12
12
...,
Here,
S1 is the number of times first tied observation is repeated
S2 is the number of times second tied observation is repeated
S3 is the number of times third observation is repeated etc.
Remarks
Problem : In a quantitative aptitude test, two judges rank the ten competitors in
the following order.
Competitor 1 2 3 4 5 6 7 8 9 10
Ranking of 4 5 2 7 8 1 6 9 3 10
judge I
Ranking of 8 3 9 10 6 7 2 5 1 4
judge II
Rx Ry d= Rx-Ry d2
4 8 -4 16
5 3 2 4
2 9 -7 49
7 10 -3 9
8 6 2 4
1 7 -6 36
6 2 4 16
9 5 4 16
3 1 2 4
10 4 6 36
Total 190
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
(6)(190)
= 1 = 1-1.1515
10(100 1)
= - 0.1515
We say that there is low degree of negative rank correlation between the
two judges.
Problem : Twelve recruits were subjected to selection test to ascertain their suitability
for a certain course of training. At the end of training they were given a proficiency test.
The marks scored by the recruits are recorded below :
Recruit 1 2 3 4 5 6 7 8 9 10 11 12
Selection Test 44 49 52 54 47 76 65 60 63 58 50 67
Score
Proficiency 48 55 45 60 43 80 58 50 77 46 47 65
Test Scrore
Solution: Let selection test score be a variable X and proficiency test score be a variable
Y. We associate the ranks to the scores based on their magnitudes.
6 d2
p 1 - 2
n(n - 1)
n = Number of recruits
Given
X Y RX RY d=RX-RY d2
44 48 12 8 4 16
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
49 55 10 6 4 16
52 45 8 11 -3 9
54 60 7 4 3 9
47 43 11 12 -1 1
76 80 1 1 0 0
65 58 3 5 -2 4
60 50 5 7 -2 4
63 77 4 2 2 4
58 46 6 10 -4 16
50 47 9 9 0 0
67 65 2 3 -1 1
-- -- -- -- Total 80
d 2
= 80, n = 12
(6) (80)
p 1 - 1 - 0.2797 0.7203
12 (144 - 1)
0.7203
We say that there is high degree of positive rank correlation between the
scores of selection and proficiency tests.
Problem : Following is the data on heights and weights of ten students in a class:
Heights 140 142 140 160 150 155 160 157 140 170
(in cm)
Weights 43 45 42 50 45 52 57 48 49 53
(in cm)
Solution:
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
Let height be a variable X and weight be a variable Y. Since, the data contains tied
observations, we associate average ranks to the tied observations.
6(Adj d 2 )
1- 2 .
n(n - 1
S13 S1 S32 S1
Where, Adj d 2 d
2
+….
12 12
n = Number of students
X Y RX RY d=RX-RY d2
140 43 9 9 0 0
142 45 7 7.5 -0.5 0.25
140 42 9 10 -1 1
160 50 2.5 4 -1.5 2.25
150 45 6 7.5 -1.5 2.25
155 52 5 3 2 4
160 57 2.5 1 1.5 2.25
157 48 4 6 -2 4
140 49 9 5 4 16
170 53 1 2 -1 1
-- -- -- -- Total 33.00
n = 10, d 2
33 S1 = 3, S2 = 2, S3 = 33
33 3 2 3 2 2 3 2
Thus, Adj d 33
2
12 12 12
= 33 + 2 + 0.5 + 0.5
Dr.Mokesh Rayalu,M.Sc,Ph.D.,
MAT2001–Statistics for Engineers
= 36
(6) (36)
1- 1 - 0.2182 0.7818
10 (100 - 1)
0.7818
We say that there is high degree of positive rank correlation between
heights and weights of students.
Dr.Mokesh Rayalu,M.Sc,Ph.D.,