Correlation & Regression
Correlation & Regression
Regression
Analysis
DR. REKHA PRASAD
IM, BHU
INTRODUCTION
Correlation and regression analysis are two statistical methods used to
explore relationships between variables, but they serve different purposes.
1. Correlation Analysis
Correlation measures the strength and direction of the linear
relationship between two variables.
The result is expressed as the correlation coefficient (r), which
ranges from -1 to +1:
r = +1: Perfect positive linear relationship.
r = -1: Perfect negative linear relationship.
r = 0: No linear relationship.
Example: If you measure hours studied and exam scores, correlation tells you if more
study hours lead to higher scores.
CONT…
2. Regression Analysis
Regression goes a step further by modeling the relationship between a
dependent variable (outcome) and one or more independent variables
(predictors).
It provides a mathematical equation, like Y = a + bX, where:
Y is the dependent variable,
X is the independent variable,
b is the slope (rate of change),
a is the intercept.
Types of Regression:
Linear Regression: Examines one independent variable.
Multiple Regression: Examines several independent variables.
KEY DIFFERENCES BETWEEN
CORRELATION AND
REGRESSION
Correlation Regression
Feature
Analysis Analysis
Measures
Predicts or
Objective relationship
explains values
strength
Correlation Regression
Output
coefficient (r) equation
Dependent &
Symmetrical (no
Dependency independent
distinction)
defined
KARL PEARSON’S CO-
EFFICIENT OF CORRELATION
Karl Pearson's Coefficient of Correlation is a statistical measure that
quantifies the strength and direction of the linear relationship between two
variables. It is represented by the symbol 'r' and ranges from -1 to +1.
Here's how to interpret it:
+1: Perfect positive correlation (as one variable increases, the other
increases proportionally).
-1: Perfect negative correlation (as one variable increases, the other
decreases proportionally).
0: No linear correlation (no relationship between the variables).
CONT…
The formula for calculating Karl Pearson’s coefficient is:
where:
x and y: The two variables being compared.
n: The number of paired observations.
∑: Denotes summation.
It is widely used in fields like economics, social sciences, and engineering to study
PROBLEM
X Y X2 Y2 XY
150 50 22500 2500 7500
160 55 25600 3025 8800
165 58 27225 3364 9570
170 60 28900 3600 10200
175 63 30625 3969 11025
∑X = 820 ∑Y = 286 ∑X2 = ∑Y2 = 16458 ∑ XY =
134750 47095
Step 2: Substitute the values in the formula for correlation
n=5, ∑X=820, ∑Y=286, ∑X2=134750, ∑Y2=16458, ∑XY=47095
r={5(47095)−(820)(286)}/{√[5(134750)−(820)2][5(16458)−(286)2]}\
Step 3: Simplify
r=(955/955.9812)≈ 0.998974
Result:
There is strong correlation between height and weight of 99.894%
SPEARMAN’S COEFFICIENT OF
CORRELATION
The Spearman's Rank Correlation Coefficient, often denoted
as rs, measures the strength and direction of the relationship
between two ranked variables. It's particularly useful for non-
linear data or ordinal variables. Its formula is:
D 95 88
Find the Spearman's Rank Correlation Coefficient.
E 70 72
CONT…
SOLUTION:
Step 1: Rank the data
Assign ranks to the scores in each subject. Higher scores get lower rank (rank
1 for the highest score, and so on).
d=Rank X−Ra
Student Rank X Rank Y d2
nkY
A 2 1 1 1
B 5 5 0 0
C 4 4 0 0
D 1 2 -1 1
E 3 3 0 0
Sum of d2 = 1 + 0 + 0 + 1 + 0 = 2.
CONT…
Step 3: Substitute into the formular
rs = 1 - {6∑d2/n(n2-1)} = 1 - {6*2/5(25-1)} = 1- {12/(125-5)} =1-.1 = .9
Final Answer:
The Spearman’s Rank Correlation Coefficient is 0.9, indicating a strong
positive relationship between the two subjects.
LINES AND EQUATIONS OF
REGRESSION
The lines and equations of regression are used in statistics to model the
relationship between two variables, typically denoted as X (independent
variable) and Y (dependent variable). There are two regression lines:
1. Regression Line of Y on X
This line predicts the values of Y based on given values of X. The equation is:
Y=a+bX
Where:
a is the intercept (value of Y when X = 0)
b is the slope or regression coefficient, which shows the change in Y for a unit
change in X.
The slope b is calculated as:
b = {n∑(XY)−∑X∑Y}/{n∑(X2)−(∑X)2}
CONT…
Regression Line of X on Y
This line predicts the values of X based on given values of Y. The equation is:
X=c+dY
Where:
c is the intercept (value of X when Y=0).
d is the slope or regression coefficient for X on Y.
The slope d is calculated as:
d={n∑(XY)−∑X∑Y}}/{n∑(Y2)−(∑Y)2}
Key Points to Note:
The two regression lines intersect at the point (Xˉ,Yˉ), where Xˉ and
Yˉare the means of X and Y, respectively.
If the correlation coefficient between X and Y (denoted as r) is perfect
(r=±1), the two lines will coincide into a single line.
PROBLEM
EXAMPLE: The following data shows the marks obtained by 5 students in
Mathematics (X) and Science (Y):
Student X Y X×Y X2 Y2
A 10 15 150 100 225
B 20 25 500 400 625
C 30 35 1050 900 1225
D 40 45 1800 1600 2025
E 50 50 2500 2500 2500
TOTAL 150 170 6000 55000 6600
CONT…
Step 2: Regression line of Y on X
Equation of line Y = a + bY and b = = {n∑(XY)−∑X∑Y}/{n∑(X2)−(∑X)2}
Substitute the values:
b = (5*6000 – 150*170)/(5*5500 – 150*150) = 0.9
The intercept a is: a=∑Y/n−b⋅∑X/n = 170/5 – (.9)*150/5 = 34−27=7
So, the regression line of Y on X is:
Y=7+0.9X
Step 3: Regression line of X on Y
Equation of line X = c + dX and d = = {n∑(XY)−∑X∑Y}/{n∑(Y2)−(∑Y)2}
Substitute the values:
b = (5*6000 – 150*170)/(5*6600 – 170*170) = 1.1
The intercept a is: c =∑X/n−d⋅∑Y/n = 150/5 – (1.1)*170/5 = 30−37.4=−7.4
So, the regression line of X on Y is:
X=-7.4+1.1Y
CONT…
Final Answer:
Regression line of Y on X: Y=7+0.9X
Regression line of X on Y: X=−7.4+1.1Y
RELATION BETWEEN
CORRELATION COEFFICIENT AND
REGRESSSION
coefficient (bb) is rooted inCOEFFICIENT
The relationship between the correlation coefficient (r) and the regression
how they describe the relationship between two
variables in a linear relationship.
Key Connections:
Correlation Coefficient (r):
It measures the strength and direction of the linear relationship between two
variables.
It is a standardized value that ranges from −1 to +1.
−1 indicates a perfect negative linear correlation, +1 indicates a perfect positive
linear correlation, and 0 indicates no linear correlation.
CONT…
Regression Coefficient (b):
It represents the slope of the regression line in the linear regression
equation y=a+bx, where a is the intercept.
It quantifies how much the dependent variable (y) changes for a one-unit
change in the independent variable (x).
Mathematical Relationship: The correlation coefficient is related to the
regression coefficients as:
r=±√byx⋅bxy
where byx is the regression coefficient of y on x, and bxy is the regression
coefficient of x on y.
The sign of r is determined by the direction of the relationship between x and
y.
CONT…
Standard Deviations and Regression Coefficient: The regression
coefficient is influenced by the standard deviations of the variables:
byx=r⋅σy/σx
bxy=r⋅σx/σy
Here, σx & σy are the standard deviations of x and yy, respectively.
In summary, the correlation coefficient is a unit less measure of linear
association, while regression coefficients express the nature of the
relationship in terms of the units of the variables. Both are interrelated, as
r can be derived from the regression coefficients.
PROBLEM
EXAMPLE:
Consider the following dataset of two variables x and y:
x y
1 2
2 4
3 6
SOLUTION: 4 8
Step 1: Calculate r, byx, bxy from the given data
For this one requires the value of ∑X, ∑Y, ∑X2, ∑Y2, ∑XY
CONT…
Calculate these values
X) Y X2 Y2 XY
1 2 1 4 2
2 4 4 16 8
9 18
3 6
36
4 8 16 64 32
10 20 30 120 60
r = (5*60 – 10*20)/{√(5*30-(10)*(10)√5*120 –(20)(20)}
= (300-200)/√(150 -100)(600 – 400)
= 100/√50*200 = 100/√10000 = 100/100 = 1
byx = {n∑(XY)−∑X∑Y}/{n∑(X2)−(∑X)2} = (5*60-10*20)/(5*30-(10)(10))
= (300-200)/(150 – 100)= 100/50 = 2
CONT…
bXY = {n∑(XY)−∑X∑Y}/{n∑(Y2)−(∑Y)2}
=100/(5*120-400) = 100/(600-400) = 100/200 = ½
byx * bxy = 2*(1/2) = 1 = r
Result:
Proved r=±√byx⋅bxy