Lecture 7
Lecture 7
Contents
7.1 The Simple Linear Regression Model
The Regression Model and the Regression Equation
The Estimated Regression Equation
7.2 The Least Squares Method
7.3 The Coefficient of Determination
The Correlation
Learning Objectives
• Use regression analysis to develop an equation relating two variables.
• Carry out predictions using the estimated regression Equation.
• Compute and interpret the correlation between the two variables.
Introduction
In decision making, managers usually rely on intuition to judge how two or more
variables are related. If data however, can be obtained, a statistical procedure called
regression analysis can be used to develop an equation showing how the variables
are related.
Simple linear regression is the simplest type of regression analysis involving one
independent variable and one dependent variable. Their relation is approximated
by a straight line.
The equation that describe how 𝑦 is related to 𝑥 and an error term is called the
regression model.
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
1
𝑦 is the linear function of 𝑥. 𝛽0 and 𝛽1 are referred to as parameters of the model
and 𝜀 is a random variable referred to as the 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚. It accounts for the
variability in 𝑦 that cannot be explained by the linear relationship between 𝑥 and 𝑦.
If we assume that the mean or the expected value if 𝜀 is zero then:
𝐸(𝑦) = 𝛽0 + 𝛽1 𝑥
The graph of the simple linear regression equation is a straight line; 𝛽0 is the 𝑦
intercept of the regression line, 𝛽1is the slope, and 𝐸(𝑦) is the mean or expected
value of 𝑦 for a given value of𝑥.
𝑦̂ = 𝑏0 + 𝑏1 𝑥 .
The graph of the estimated simple linear regression equation is called the estimated
regression line,
Where:
𝑏0 is the 𝑦 intercept
𝑏1 is the slope
𝑦̂ is the estimated value of y for a given value of 𝑥.
b1 =
( x − x )( y − y )
i i
Or 𝑏1 =
(∑ 𝑥𝑖 ∑ 𝑦𝑖 )
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
Or 𝑏1 =
𝑛 ∑ 𝑥𝑖 𝑦𝑖 −∑ 𝑥𝑖 ∑ 𝑦𝑖
( x − x)
2
2 (∑ 𝑥 𝑖) 𝑛 ∑ 𝑥𝑖2 −(∑ 𝑥𝑖 )2
∑ 𝑥𝑖2 −
i 𝑛
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
Where:
𝑥𝑖 = value of the independent variable for the ith observation.
2
𝑦𝑖 = value of the dependent variable for the ith observation.
𝑥̅ = mean value for the independent variable.
𝑦̅ = mean value for the dependent variable.
𝑛 = total number of observations.
Example
Below are data a sales manager has collected on annual sales and years of
experience.
Table 1
a. Estimate a regression equation that can be used to predict annual sales given
the years of experience.
b. Use the estimated regression equation to predict annual sales for a sales person
with nine years of experience.
3
Solution:
Table 2: Calculations for the least squares estimated regression equation.
a. b1 =
( x − x )( y − y ) = 568 = 4.
i i
142
( x − x)
2
i
∑ 𝑥𝑖 70
𝑥̅ = = 10 = 7
𝑛
∑ 𝑦𝑖 1080
𝑦̅ = = = 108
𝑛 10
→ 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
= 108 − (4)(7)
= 80.
∴ 𝑦̂ = 𝑏0 + 𝑏1 𝑥
= 80 + 4𝑥.
b. For 𝑥 = 9
→ 𝑦̂ = 80 + (4)(9)
= 116.
Coefficient of Determination
In the previous example the estimated regression equation 𝑦𝑖 = 80 + 4𝑥 is used
to estimate the linear relationship between the years of experience (𝑥) and the
annual sales (𝑦). A question now is: How well does the estimated regression
equation fit the data?
The coefficient of determination provides a measure of the goodness of fit for the
estimated regression equation.
4
Definition
The difference between the observed value of the dependent variable 𝑦𝑖 and the
estimated value of the dependent variable 𝑦̂𝑖 , using the estimated regression
equation is called the ith residual; that is for the ith observation, the residual is 𝑦𝑖 – 𝑦̂𝑖 .
Note:
The ith residual represents the error in using 𝑦̂𝑖 to estimate 𝑦𝑖 . The sum of square of
these residual or errors is known as sum of squares due to errors and is denoted by
𝑆𝑆𝐸.
Now suppose we want to estimate the annual sales without knowledge of any
related variable e.g. (Years of experience), we would use the sample mean as an
estimate of sales for any given sales person. For the ith salesperson in the sample, the
difference 𝑦𝑖 – 𝑦̅ provides a measure of the error involved in using 𝑦̅ to estimated
sales. The corresponding sum of square, called the total sum of squares, and is
denoted by 𝑆𝑆𝑇.
5
Total sum of Squares: 𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2
The estimated regression equation would provide a perfect fit if every value of the
dependent variable 𝑦𝑖 happened to lie on the estimated regression line, resulting in
𝑆𝑆𝐸 = 0.
𝑆𝑆𝑅 𝑆𝑆𝑅
For a perfect fit 𝑆𝑆𝑅 must equal 𝑆𝑆𝑇 and the ratio 𝑆𝑆𝑇 must equal one. The ratio 𝑆𝑆𝑇 ,
which will take values between zero and one, is used to evaluate the goodness of fit
for the estimated regression equation. This ratio is called the
𝑆𝑆𝑅
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 and is denoted by 𝑟 2 . Where, 𝑟 2 = 𝑆𝑆𝑇 .
6
When the coefficient of determination is expressed as a percentage, 𝑟 2 can be
interpreted as the percentage of the total sum of square that can be explained by
using the estimated regression equation.
Note: The total sum of square can be computed by using the following alternative
formula.
2
(∑ 𝑦𝑖 )
𝑆𝑆𝑇 = ∑ 𝑦𝑖2 −
𝑛
The sum of squares due to regression can be computed by using the following
alternative formula.
∑ 𝑥𝑖 ∑ 𝑦𝑖 2
[∑ 𝑥𝑖 𝑦𝑖 − ]
𝑛
𝑆𝑆𝑅 =
(∑ 𝑥𝑖 )2
∑ 𝑥𝑖2 −
𝑛
Example
Consider the following data.
𝑥𝑖 2 3 5 1 8
𝑦𝑖 25 25 20 30 16
7
The estimated regression equation for these data is 𝑦̂ = 30.33 − 1.88𝑥.
a. Compute 𝑆𝑆𝐸, 𝑆𝑆𝑇, and 𝑆𝑆𝑅.
b. Compute the coefficient of determination 𝑟 2 .
c. Compute the sample correlation coefficient.
Solution
3.8
23.2
= (𝑠𝑖𝑔𝑛 𝑜𝑓 𝑏1 )√𝑟 2.
(∑ 𝑥𝑖 ∑ 𝑦𝑖 ) (19)(116)
∑ 𝑥𝑖 𝑦𝑖 − 383−
𝑛 5
But 𝑏1 = 2 = (19)2
= −1.887
(∑ 𝑥𝑖 ) 103−
∑ 𝑥𝑖2 − 5
𝑛
∴ 𝑟 = (−)√0.948 = −0.972.
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
𝑏1 −𝛽1 𝑏1
Test statistics: 𝑡 − 𝑠𝑡𝑎𝑡𝑠 = = (iff β1 = 0)
𝑠𝑏1 𝑠𝑏1
𝑆Ɛ
Where, 𝑠𝑏1 is the standard deviation of 𝑏1 , defined as: 𝑠𝑏1 =
√(𝑛−1)𝑠𝑥2
If the error term (Ɛ) is normally distributed, the test statistic has a Student t-
distribution with n-2 degree of freedom. If H0 is true then 1 =0 and t is greater than
2 (Rule of thumb).
9
III. Rejection rule: Reject H0 if t-stat>t-critical where t-critical is based on the
student t-distribution with n-2 degree of freedom. You can also apply the rule of
thumb which states: reject H0 if t-stats is greater than 2 and n ≥ 30.
NB: you can also apply the p-value approach: Reject H0 if p-value ≤ α
F-test
An F-test, based on the F probability distribution, can also be used to test for
significance in regression. In fact, 𝐹 = 𝑡 2 for a simple regression model. The F-test
provides the same conclusion as the t-test. The F-test can be used to test for an
overall significant relationship of the model.
𝑀𝑆𝑅
The test statistic is given as: 𝐹 = 𝑀𝑆𝐸
Where, MSR is the mean square due to regression, or simply the mean square
𝑆𝑆𝑅
regression obtained as 𝑀𝑆𝑅 = ; and MSE is the Mean Square Error obtained by
1
𝑆𝑆𝐸
𝑀𝑆𝐸 = 𝑛−2 .
We can construct an ANOVA table to summarise the result of the F test for
significance in regression.
Variation
Total n–1
in y (SST)
10
UNIT SUMMARY
I. The graph of the estimated simple linear regression equation is called the
estimated regression line, 𝑦̂ = 𝑏0 + 𝑏1 𝑥.
Where:
𝑏0 is the 𝑦 intercept
𝑏1 is the slope
𝑦̂ is the estimated value of y for a given value of 𝑥.
(∑ 𝑥𝑖 ∑ 𝑦𝑖 )
b1 =
( x − x )( y − y )
i i Or 𝑏1 =
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
2
( x − x)
(∑ 𝑥 𝑖)
∑ 𝑥𝑖2 −
2
i 𝑛
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ .
II. The coefficient of determination provides a measure of the goodness of fit for
the estimated regression equation
III. The sum of square of these residual or errors is known as sum of squares due to
errors and is denoted by 𝑆𝑆𝐸.
IV. The corresponding sum of square, called the total sum of squares, and is
denoted by 𝑆𝑆𝑇 . (𝑦𝑖 – 𝑦̅ ) provides a measure of the error involved in using 𝑦̅ to
estimated sales
VI. The sum of squares, called the sum of squares due to regression is denoted by
𝑆𝑆𝑅.
𝑆𝑆𝑅
VII. The ratio , which will take values between zero and one, is used to evaluate
𝑆𝑆𝑇
the goodness of fit for the estimated regression equation. This ratio is called the
𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 𝒐𝒇 𝒅𝒆𝒕𝒆𝒓𝒎𝒊𝒏𝒂𝒕𝒊𝒐𝒏 and is denoted by 𝑟 2 .
11
SELF-TEST EXERCISES
Question 1
The following sample data contains the number of years spent at university and the
current annual salary for a random sample of Electronic equipment salespeople
employed by HI-Tech Corporation.
Annual Income
Years spent at University (In Thousands)
2 20
2 23
3 25
4 26
3 28
1 29
4 27
3 30
4 33
4 35
Required:
a. Which variable is the dependent variable? Which is the independent variable?
b. Determine the least squares estimated regression line.
c. Predict the annual income of a salesperson with one year of college.
d. Calculate the coefficient of determination.
e. Calculate the sample correlation coefficient between income and years of
college. Interpret the value you obtain.
Question 2
Below you are given information on a woman's age and her annual expenditure on
purchase of books.
Age Annual Expenditure ($)
18 210
22 180
21 220
28 280
Required:
a. Develop the least squares regression equation.
b. Compute SSR, SSE and SST.
c. Compute the coefficient of determination.
d. Compute the sample correlation coefficient.
12