Correlation
(Association between Variables)
Correlation is one type of bivariate statistics, a measure of association, which provides the types
of correlation and also degree of correlation between two variables in the sense that the changes
in the values of one variable associated with the changes in the values of the other variable.
Measurement of correlation between two variables is defined in a single number called the
correlation coefficient. This is based on the association between rankings of two variables, or
the association of distances between values of the variable which provides a detailed idea of the
nature of association between variables.
❖ Types of Correlation:
• Positive and negative correlation
Correlation is positive means relationship between the two variables is such that the values of
the two variables move in the same direction so that with increases (decreases) in one of the
variables being associated with increases (decreases) in the other variable.
Correlation is negative means relationship between the two variables is such that the values of
the two variables move in the opposite direction so that with increases (decreases) in one of the
variables being associated with decreases (increases) in the other variable.
• Linear and non-linear correlation
• Simple and Multiple Correlation
Simple Correlation - When we consider only two variables and check the correlation between
them it is said to be Simple Correlation.
Multiple Correlation - When we consider three or more variables for correlation simultaneously,
it is termed as Multiple Correlation. For example, if we study the relationship between poverty,
infant mortality and education, it is a problem of multiple correlations.
❖ Correlation coefficient
The correlation coefficient (𝑟) is a summary measure that describes the extent (degree/strength)
of the association between two ordinal, interval or ratio level variables. The correlation
coefficient is scaled so that it is always between −1 and +1. A correlation coefficient close to 1
means a positive relationship, and close to −1 indicates a negative relationship between the two
variables. A correlation coefficient close to 0, but either positive or negative implies little or no
relationship between the two variables.
For interval or ratio level scales, the most commonly used correlation coefficient is Pearson’s 𝑟.
For ordinal scales, the correlation coefficient which is usually calculated is Spearman’s rho.
❖ Methods of studying Correlation
• Scatter Diagram
Scatter diagram, which is graphing combinations of two variables, is the simplest method of
studying correlation between two variables. This diagram shows the values of two variables X
and Y, along with the way in which these two variables relate to each other. However, in this
method we cannot measure the exact degree of correlation between the variables.
For purposes of drawing a scatter diagram, and determining the correlation coefficient, it does
not matter which of the two variables is the X variable, and which is Y. Correlation methods are
symmetric with respect to the two variables, with no indication of causation or direction of
influence.
Example: Consider a salary data stored in Salary_Data.csv containing ‘Years of experience’ (say
variable X) and ‘Salary’ (say variable Y). Looking the following scatter diagram for this data,
the relationship between X and Y can be seen at a glance. This diagram indicates a generally
positive correlation between X and Y. It can be seen that larger (smaller) values of X are
associated with larger (smaller) values of Y and vice versa. That is, as the number of years of
experiences increases (decreases), generally the salary also increases (decreases).
• Karl Pearson’s co-efficient of correlation (Pearson's 𝒓)
The Pearson product-moment correlation coefficient, known as the Pearson’s 𝒓, is widely
used correlation coefficient. Pearson's 𝑟 summarizes the relationship between two variables
that have a straight line or linear relationship with each other.
Suppose that there are two variables X and Y, and 𝑥1 , 𝑥2 , … , 𝑥𝑛 and 𝑦1 , 𝑦2 , … , 𝑦𝑛 respectively
are 𝑛 observations (values) of the two variables respectively. Let the mean of X be 𝑥̅ and the
mean of Y be 𝑦̅. Pearson's 𝑟 is defined as follows:
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑟 = 𝐶𝑜𝑟(𝑋, 𝑌) =
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 √∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2
𝐶𝑜𝑣(𝑋, 𝑌) 𝐸[(𝑋 − 𝑚𝑋 )(𝑌 − 𝑚𝑌 )] 𝐸(𝑋𝑌) − 𝐸(𝑋)𝐸(𝑌)
𝐶𝑜𝑟(𝑋, 𝑌) = = =
𝑠. 𝑑. (𝑋) 𝑠. 𝑑. (𝑌) √𝑉𝑎𝑟(𝑋)√𝑉𝑎𝑟(𝑌) 2 2
√𝐸(𝑋 2 ) − (𝐸(𝑋)) √𝐸(𝑌 2 ) − (𝐸(𝑌))
[ ]
Or equivalently,
𝑛 ∑ 𝑥𝑖 𝑦𝑖 − (∑ 𝑥𝑖 )(∑ 𝑦𝑖 )
𝑟=
√𝑛 ∑ 𝑥𝑖2 − (∑ 𝑥𝑖 )2 √𝑛 ∑ 𝑦𝑖2 − (∑ 𝑦𝑖 )2
Interpretation of 𝒓:
The value of the correlation coefficient always lie between −1 and +1, i.e., −1 ≤ 𝑟 ≤ 1. Here
𝑟 = +1 means a perfect positive correlation between the variables X and Y, whereas 𝑟 = −1,
means a perfect negative correlation between the variables. When 𝑟 = 0, there is no relationship
between the two variables.
Value of 𝑟 close to +1 indicates strong positive correlation (the values of X and Y are strongly
positively associated), 𝑟 close to −1 indicates that the values of X and Y are strongly negatively
associated, and 𝑟 close to 0, either on the positive or the negative side, means there is little
association between X and Y.
(Note: While calculating 𝑟 we keep outliers either to a minimum or remove them entirely).
For example, for the salary data stored in Salary_Data.csv, correlation coefficient between
‘Years of experience’ (say variable X) and ‘Salary’ (say variable Y) is 𝑟 = 0.978. So the
variable X and Y have strong positive correlation.
Test of Significance for 𝒓:
The sample data are used to compute 𝑟, i.e. the correlation coefficient for the sample. The
reliability of the linear model also depends on how many observed data points are in the sample.
If we had data for the entire population, we could find the population correlation coefficient. So
the sample correlation coefficient (𝒓) is our estimate of the unknown population correlation
coefficient.
Let us denote the true correlation coefficient that would be observed if all population values were
obtained by 𝝆 (rho).
We need to perform a hypothesis test of the "significance of the correlation coefficient" to decide
whether the linear relationship in the sample data is strong enough to use to model the
relationship in the population.
The hypothesis test lets us decide whether the value of the population correlation coefficient 𝜌 is
“close to zero” or “significantly different from zero” based on the sample correlation coefficient
𝑟 and the sample size 𝑛.
The null hypothesis is
𝐻0 : 𝜌 = 0
The alternative hypothesis could be any one of three forms: 𝐻𝑎 : 𝜌 ≠ 0, 𝐻𝑎 : 𝜌 > 0 or 𝐻𝑎 : 𝜌 < 0.
Null Hypothesis is that the population correlation coefficient IS NOT significantly different from
zero. That means there is a significant linear relationship (correlation) between the variables X
and Y. Alternate Hypothesis is that the population correlation coefficient IS significantly
different from zero. That means there is a significant linear relationship (positive or negative)
between the variables.
Test statistic: As various samples each of sample size 𝑛 are drawn, the values of 𝑟 vary from
sample to sample. The sampling distribution of 𝑟 is approximated by a 𝑡 distribution with 𝑛 − 2
degrees for freedom. The standard deviation of 𝑟 can be shown to be approximated by
1−𝑟 2
√ , 𝑟 is the sample or observed correlation coefficient.
𝑛−2
Then for the null hypothesis 𝐻0 : 𝜌 = 0, the standardized 𝑡 statistic can be written as
𝑟−𝜌 𝑛−2
𝑡= = 𝑟√
2 1 − 𝑟2
√1 − 𝑟
𝑛−2
Based on the value of the test statistic, 𝑡, we can calculate the 𝑝-value or equivalently we can
check whether this 𝑡 value is in the critical region or not. Note that he test statistic 𝑡 has the same
sign as the correlation coefficient 𝑟.
For example, for the salary data Salary_Data.csv, observed correlation coefficient between
‘Years of experience’ (X) and ‘Salary’ (Y) is 𝑟 = 0.978 and there are 𝑛 = 30 observations. We
want to test the significance of the correlation between these variables. According to our
observed correlation coefficient, we take the research (alternative) hypothesis as the two
variables are positively related (𝐻𝑎 : 𝜌 > 0), i.e. true correlation coefficient 𝜌 is significantly
greater that zero, against the null hypothesis as that there is no relationship between these two
variables(𝐻0 : 𝜌 = 0). .
The value of standardized 𝑡 statistic is obtained as
𝑛−2
𝑡 = 𝑟√ = 24.8
1 − 𝑟2
For a one tailed test with 𝑛 − 2 = 28 degrees of freedom, 𝑡 = 2.4671 for the 0.01 level of
significance. So the corresponding 𝑡 value is in the region of rejection of 𝐻0 (as 24.8>2.4671),
and hence the null hypothesis is rejected. Thus the alternative hypothesis that the two variables
are positively correlated is accepted.
• Spearman’s rank correlation coefficient (Spearman’s rho)
The Pearson’s correlation coefficient can be used for interval or ratio level scales. When a
variable is measured at the ordinal level, then we need to use a correlation coefficient designed
for an ordinal level scale. If a scale is ordinal, it is possible to rank the different values of the
variable.
In order to compute correlation coefficient for two such variables whose values have been
ranked, Spearman considered the numerical differences in the respective ranks. The correlation
coefficient so obtained is called rank correlation coefficient.
For instance, suppose 10 different universities are ranked in ‘Management’ and Engineering’
category, and we wish to know the correlation between the two rankings, then Spearman’s rank
correlation will be appropriate method for that. If actual data are given instead of rank, then we
must give them rank. We can assign ranks by ordering the values from low to high, or from high
to low.
Suppose that there are two variables X and Y. For each observed case, the rank for each of the
variables X and Y is determined. For each case i, the difference in the rank on variables X and on
variable Y is determined, and given the symbol 𝐷𝑖 . If there are 𝑛 cases, the Spearman rank
correlation between X and Y is defined as
6 ∑ 𝐷𝑖 2
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)
The true Spearman correlation coefficient is denoted by 𝜌𝑠 .
Let’s consider the following data and calculate Spearman’s rank correlation coefficient. Score
out of 100 in Management category (X) and Engineering category (Y) for 10 universities are
given.
Universities X Y Rank on X Rank on Y 𝐷𝑖 𝐷𝑖 2
U1 92 78 2 5 -3 9
U2 95 80 1 4 -3 9
U3 82 83 4 3 1 1
U4 67 72 9 7 2 4
U5 88 88 3 1 2 4
U6 65 70 10 8 2 4
U7 78 85 5 2 3 9
U8 70 75 8 6 2 4
U9 72 69 7 9 -2 4
U10 75 66 6 10 -4 16
We have
6 ∑ 𝐷𝑖 2 6 × 64
𝑟𝑠 = 1 − 2
=1− = 0.61
𝑛(𝑛 − 1) 10(100 − 1)
So the result indicates that there is moderate positive correlation.