Correlation Analysis
Correlation Analysis
Introduction:
So far we have studied problems relating to one variable such as measures of central tendency,
measures of dispersion, measures of skewness, measure of kurtosis etc. For example one can find the mean
height of 50 persons, mean weight of 50 persons, mean chest size of 50 persons etc. In all the above examples
we are dealing with one variable. Now one question arises into our minds that, Is there any relation between
these variables i.e. height, weight and chest size? If yes, then how much relation is there? And what type of
relation is there? Similarly if we take some more examples like ages of husbands and their wives, income and
expenditure, price of commodity and its demand, amount of rainfall, daily temperature and yield of crop,
price and supply etc. the same questions arises. So to answer these questions one should deal with two or
more variables.
Bivariate Data:
Scattered diagram:
The graphical representation of bivariate data is called scattered diagram. This diagram can act as an
instrument for the condensation of bivariate data indicating the type of relation existing between the
concerned variables.
For example a bivariate data on heights of fathers and their sons is given below
x
y
64
62
65
63
66
68
67
68
67
66
68
70
68
69
69
72
70
68
72
71
Correlation:
The term correlation (or co variation) indicates the relation between two such variables in which with
changes in the values of one variable, the values of the other variable also change.
Most of the variable shows some kind of relation ship. For example there is a relation between price
and supply, income and expenditure etc. With the help of correlation analysis we can measure in one
figure the degree and direction of relationship between the variables.
Once we know that two variables are closely related, we can estimate the value of one variable given
the value of another by using regression analysis.
The study of the correlation reduces the range of uncertainty associated with the decision making.
The prediction based on correlation analysis is likely to be more valuable and nearer to reality.
Correlation analysis is very helpful in understanding economic behavior.
In business, correlation analysis enables the executive to estimate costs, sales, prices and other
variables on the basis of some other series with which these costs, sales or prices may be functionally
related.
Thus the correlation studies are very widely used for variety of purposes and are considered to be
basic tools for detailed analysis and interpretation of statistical data relating to two or more variables.
Types of correlation:
Correlation is classified in several different ways. Three of the most important ways of classifying
correlation are
1. Positive and Negative Correlation.
2. Simple, Partial and Multiple Correlations.
3. Linear and Non linear Correlation
1.
Whether correlation is positive or negative would depend on the direction of change of one variable
with change in the other variable. If both the variables are moving in the same direction it is called positive
correlation. In other words, if with the increase or decrease in values of one variable causes an increase or
decrease in the values of other variable then it is called positive correlation.
For example, heights and weights, income and expenditure, rainfall and yield of crops, supply and
price of commodity etc. this is also called direct correlation.
On the other hand if the variables are moving in opposite direction to each other, it is called
negative or inverse correlation. In other words if the increase or decrease in the values of one variable
causes the decrease or increase in the value of other variable then it is called negative correlation. This is
also known as indirect correlation.
For example, price and demand, temperature and woolen garments etc.
Examples:
Positive correlation
X
10
15
20
30
40
Y
120
127
140
170
180
X
15
10
8
5
3
Y
100
80
75
50
30
Negative correlation
X
10
15
20
35
50
Y
150
127
125
100
75
X
80
65
50
40
25
Y
500
600
650
750
875
If the relationship between any two variables is studied it is called as simple correlation.
Ex: Relation ship between heights and weights, income and expenditure, rainfall and production of crops,
price and demand etc.
In multiple correlation we study together the relationship between three or more factors Ex:
Correlation between yield of rice and both the amount of rainfall and amount of fertilizers used.
In partial correlation though more than two factors are involved but correlation is studied only
between two factors and the other factors are assumed to be constant.
Ex: Let us consider the three factors yield of rice, amount of rainfall and the temperature in different time
periods and if we limit our study of correlation between yield of rice and amount of rainfall by assuming that
a constant temperature is existed daily it becomes a problem relating to partial correlation.
X
Y
1
10
2
20
3
30
4
40
5
50
6
60
7
70
X
X
Positive linear
Correlation
Note: Since techniques of analysis of measuring non linear correlation are far more complicated than those
for linear correlation, we generally make an assumption that the relation ship between the variables is
of the linear type.
Methods of studying correlation:
The commonly used methods for studying simple correlation between two variables are
i)
Scatter Diagram Method
ii)
Karl Pearson Coefficient Of Correlation
iii)
Spearmans Rank Correlation
1) Scatter Diagram Method
This method is the simplest for determining relationship between two variables. Under this method
the given bivariate data is plotted on a graph paper in the form of dots i.e. for each values of X and Y, we put a dot
and thus obtain as many dots as the number of paired observations. Now by looking at the scatter of the points,
we can form an idea as to whether the variables are related or not. If the plotted points show an upward trend
from left bottom towards the right top, the correlation is said to be positive and if the points are downward from
left top towards the right bottom, the correlation is negative. If all the points lie on a straight line starting from the
left bottom going up towards the right top, it is treated as perfect positive correlation. If the points lie on straight
line starting from the left top and coming to the right bottom, it is perfect negative correlation. If the points lie in
haphazard manner then there is no correlation between the two variables. This method is also known as Dot
Diagram Method.
Interpretations:
Y
Perfect positive
Correlation
Perfect negative
Correlation
X
Low degree negative
Correlation
2)
Of the several methods of measuring correlation, this method is very much popular in practice. This
method is also known as Pearsons correlation coefficient.
It is a mathematical method to measure a linear relationship between two variables.
Pearsons correlation coefficient between two random
variables X and Y usually denoted by rXY or r ( X , Y ) or r
Karl Pearson
and is defined as
Cov ( X , Y )
(Karl Pearson (n Carl Pearson)
rX ,Y
XY
Where
Cov ( X , Y ) Covariance between X and Y
X
= Standard deviation of X
Y
= Standard deviation of Y
rXY
1 n
xi x yi y
n i1
1 n
1 n
2
xi x yi y 2
n i1
n i 1
Residence England
Nationality British
Fields lawyer, Germanist, eugenicist,
mathematician and statistician (primarily the latter)
1
xi y i x y
n i 1
1 n 2
1 n 2
2
x
x
yi y 2
i
n i 1
n i 1
High -ve
Low -ve
Low +ve
High +ve
r=
-1
r= - 0.5
r=0
r=0.5
r=+1
Perfectly
-ve
No correlation
+ve
1 n
n i 1
1 n
n i 1
xi x
y y
xi x y i y
x
y
x x
2 i
x
yi y
0
y
1 1 n
1 1 n
1 1 n
2
2
2
i
i
xi x yi y 0
2
2
x y n i 1
x n i 1
y n i 1
1
2 Cov( X , Y )
0
x y
1 1 2rxy 0
2 2rxy 0 .. (1)
from (1)
3)
2 2rxy 0
2 2rxy 0
1 rxy 0
1 rxy 0
rxy 1............( 2)
rxy 1............( 2)
XA
Y B
and V
c
d
dvi B
xi cui A
i
y dv B
x cu A
the Karl Pearsons coefficient of correlation is
rXY
1 n
xi x yi y
n i 1
1 n
1 n
2
yi y 2
x
i
n i 1
n i 1
rXY
rXY
1 n
cui A cu A dvi B dv B
n i 1
n
1 n
cui A cu A 2 1 dvi B dv B 2
n i 1
n i 1
cd n
ui u vi v
n i 1
c2
n
2
ui u
i 1
d2
n
v
i 1
rXY
4)
5)
1
ui u vi v
n i 1
1 n
1 n
2
ui u
vi v 2
n i 1
n i 1
rUV
The Karl Pearsons correlation coefficient rXY is the geometric mean of two regression coefficients.
i.e. rxy bxy b yx
rXY works both ways i.e. rXY = rYX
S .E ( r )
1 r2
n
Coefficient of determination:
Coefficient of determination is defined as the square of the correlation coefficient r. Thus the
coefficient of determination is given by r2.
The coefficient of determination (r2) is defined as the ratio to the explained variance to the total
variance.
Coefficient of Deter min ation r 2
Explained Variance
Total Variance
Note:
If the variables are related with an equation aX bY c 0 then the correlation coefficient between X
and Y is r =+1 or -1 depends on a and b having different signs or a and b having equal signs.
Ex:
1. Let X and Y are related by the equation 2X+3Y=10 then the correlation coefficient between x and y
is r = -1.
2. Let X and Y are related by the equation 2X-4Y=1 then the correlation coefficient between x and y is
r = +1.
Example:1) Calculate Karl Pearsons coefficient of correlation for the following data
Example:2) Calculate Karl Pearsons coefficient of correlation for the following data and find the
probable error.
3)
rs 1
6 d i
i 1
2
n(n 1)
Where d i x ( i ) y ( i ) , (i 1,2,......, n)
x (i ) Rank of X.
y (i ) Rank of Y
1 n
x(i )
n i 1
1
1 2 3 ........ n = n 1
n
2
1 n
The mean of the ranks of Y is y y ( i )
n i 1
Now
So x y
1
1 2 3 ........ n = n 1
n
2
n2 1
12
n2 1
12
=x
x y
y [ x y ]
= x( i ) x y (i ) x
(i )
(i )
1
1
2
d i n x( i ) x y ( i ) y
n
1
2
2
x( i ) x y (i ) y 2 x(i ) x y ( i ) y
n
1
1
1
2
2
= x( i ) x y ( i ) y 2 x(i ) x y (i ) y
n
n
n
= x y 2 xy
2
= x y 2 r x y
2
= x x 2r x x [ x = y ]
2
=2 x -2 r x
2
=2 x [1- r ]
2
1
2
di
1 r n
2
2 x
1
2
di
n
r 1
n2 1
2
12
2
6 d i
rs 1
n(n 2 1)
Note: The limits of spearmans rank correlation coefficient is 1 rs 1
Interpretations of rank correlation coefficient::
1.If rs 0 , and nearer to 1 - strong agreement.
2.If rs 0 ,and nearer to -1 -strong disagreement
3.If rs 1
- perfect agreement
4.If rs 1
-perfect disagreement
5.If rs 0
- no agreement
6.
When ranks are equal:
In the process of assignment of ranks if some of the values of the items are same then each of items get
the same rank. In this case the items are given the average of the ranks they would have received.
In this case an adjustment in the above formula for calculating the rank correlation coefficient is made.
1
2
(m 3 m) to the value of
d i as many times tie occurs in both
The adjustment consists of adding
12
the variables. Here m stands for the number of items whose ranks are common.
2
6 d i (m 3 m) (m 3 m) ...........
12
12
rs 1
2
n(n 1)
Note: The answers obtained by Spearmans method and Karl Pearsons method will be the same provided no
values are repeated i.e. all items are different.
Merits:
1. This method is simpler to understand and easier to calculate as compared to Pearsons method.
2. Where the data are of qualitative nature like honesty, efficiency etc this method can be used with great
advantage.
3. Even when actual data are given, rank method can be applied for ascertaining correlation.
Demerits:
1. This method can not be used for finding out correlation in a grouped frequency distribution.
2. When the number of observations is large this method becomes tedious.
where rc stands for coefficient of correlation by the concurrent method, c stands for the number of
concurrent deviations or the number of- positive signs obtained after multiplying D x with Dy
n= Number of pairs of observations compared.
Steps.
Find out the direction of change of X variable, i.e., as compared with the first value, whether the second
value is increasing decreasing or is constant. If it is increasing put a + sign if it is decreasing put a - sign
(minus) if it is constant put zero. Similarly, as compared to second value, find out whether the third value is
increasing, decreasing or constant. Repeat the same process for other values. Denote this column by D x.
In the same manner as discussed above, find out the direction of change of Y variable and denote this
column by Dy
Multiply Dx with Dy and determine the value of C, i.e.. the number of positive sign.
Apply the above formula, i.e.
Note: The significance of signs, both (inside the under-root and outside the under-root, is that we cannot
take the under-root of minus sign.
....
Therefore, if is negative, this negative value multiplied with the minus sign inside would make it positive and
we can take the under-root. But the ultimate result would be negative. If is positive then, of course, we get a
positive value of the coefficient of correlation.
Regression Analysis
Def: Regression is the measure of the average relationship between two or more variables in terms of the
original units of the data.
(Or)
Regression analysis is used to study the functional relationship between the variables and
thereby provides a mechanism for prediction or forecasting.
It is clear from the above definitions that regression analysis is a statistical device with the
help of which we are in a position to estimate or predict the unknown values of one variable called dependent
variable from the known values of another variable called independent variable.
Regression lines:
The line describing the average relationship between two variables is known as the line of
regression or the line of best fit
If we take the case of two variables X and Y we shall have two regression lines as the regression
line of Y on X and regression line of X on Y. The regression line of Y on X gives the most probable values of Y
for given values of X i.e. here X is independent variable and Y is dependent variable. The regression line of X
on Y gives the most probable values of X for given values of Y i.e. here Y is independent variable and X is
dependent variable.
Relation between regression lines and correlation coefficient:
The regression lines drawn on the scatter diagram give indication as to the extent of
correlation between two variables.
1. If two regression lines coincides then the correlation coefficient r is +1 or -1 i.e. perfectly positive or
perfectly negative depending on the direction of the regression lines.
2. If the two regression lines intersects at right angle then r is equal to 0 i.e. no correlation.
3. If the two regression lines make a very small acute angle, r is very high i.e. the closer the regression lines
greater is the degree of correlation.
4. The farther the two regression lines from each other, the lesser are the degree of correlation.
5. Upward slope of regression indicates positive correlation and downward slope indicates negative
correlation.
r = +1
r = -1
1) The regression equation of Y on X is expressed as Y a bX . In this equation Y is the dependent variable and
X is the independent variable
2) The regression equation of X on Y is expressed as X a bY . In this equation X is the dependent variable
and Y is the independent variable
Where a and b are the numerical constants.
Determination of regression lines:
There are in general two methods to determine the regression lines.
1. Method of eye inspection.
2. Method of least squares.
In this method first take the independent variables on the X axis and dependent variable on Y axis and plot the
points. These points give a picture of scatter diagram. Now an average line is drawn through these scattered
points by eye inspection such that the number of points above the line is almost equal to the number of points
below the line. Although this method is simple, it is subjective to human bias and the regression line varies from
person to person and therefore prediction also varies.
1.
Regression line
Now
Differentiating partially S w.r.t
S
0
a
Y a bX 2 0
a
2 Y a bX 0
Y a bX 0
Y a b X
Y na b X
.. (2)
S
0
b
Y a bX 2 0
b
2 Y a bX ( X ) 0
Y a bX ( X ) 0
XY a X b X
XY
a X b X
2
2
.. (3)
Equations (2) and (3) are called normal equations. Solving these two normal equations we get
values
From (2)
na b X
ab
n
Y a bX
a Y bX .(4)
XY a X b X
XY Y bX X b X
a Y bX
XY Y X bX X b X
From (3)
XY
n
XY
n
YX
bX X
YX b
b X
XY YX
X 2
X 2
Cov ( X , Y )
.. (5)
a and
Cov ( X , Y )
XY
Y
Cov ( X , Y ) Y
We have r
X
XY X
Cov ( X , Y )
r Y
.......................................(6)
X
X2
Form (5) and (6) we get the value of b say b
b r Y .(7)
X
a Y r Y X ..(8)
X
Putting (7) and (8) in (1) we get
Yc a bX
X r Y
X
X
Yc Y r Y ( X X )
X
Yc Y r
Y
is the slope of the regression line of Y on X and is
X
Xc X r
Here b r
X
(Y Y )
Y
X
is the slope of the regression line of X on Y and is also called regression coefficient of X on Y
Y
and is denoted by b XY .
Regression Coefficients:
Y
X
X
=r
Y
bYX b XY = r
Y
X
and b XY = r
X
Y
Y X
r
= r2 r
X Y
2) Both the regression coefficients will have the same sign i.e. both are positive or both are negative. The
correlation coefficient will have the same sign as that of the regression coefficients.
i.e. if both regression
Y
X
Cov( X , Y ) Y
XY X
bYX
Cov ( X , Y )
2
X
XA
Y B
and V
c
d
dvi B
xi cui A
i
y dv B
x cu A
the regression coefficient of Y on X is
bYX
1 n
x x yi y
Cov ( X , Y ) n i 1 i
2
1 n
X
xi x 2
n i 1
n
1
cui A cu A dvi B dv B
n i 1
bYX
1 n
cui A cu A 2
n i 1
n
cd
ui u vi v
n i 1
bYX
c2 n
ui u 2
n i 1
1 n
u u vi v d
d n i 1 i
bYX
bVU
1 n
c
c
2
ui u
n i 1
bYX b XY
r
2
bYX b XY
r is true
2
bYX b XY 2r is true
r X 2r is true
X
Y
Y X 2 is true
X Y
r
X
Y
2 is true
XY
2
Y X 2 X Y is true
2
Y X 2 X Y 0 is true
2
1.
2.
3.
4.
5.
6.
Regression
Regression is a measure showing average
relationship between two variables.
Besides verification it is used for prediction of
one value in relation to the other given value.
Regression coefficient is a absolute measure.
There is no such nonsensical correlation.
It is widely useful for further mathematical
treatments.
There is functional relationship between two
variables.