0% found this document useful (0 votes)
2 views12 pages

Lecture 7

The document provides an overview of simple linear regression, including the regression model, least squares method, and coefficient of determination. It explains how to develop a regression equation to relate two variables, make predictions, and compute correlation. Additionally, it includes practical examples and calculations to illustrate the concepts.

Uploaded by

munesubishi13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views12 pages

Lecture 7

The document provides an overview of simple linear regression, including the regression model, least squares method, and coefficient of determination. It explains how to develop a regression equation to relate two variables, make predictions, and compute correlation. Additionally, it includes practical examples and calculations to illustrate the concepts.

Uploaded by

munesubishi13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Topic: Simple Linear Regression

Contents
7.1 The Simple Linear Regression Model
The Regression Model and the Regression Equation
The Estimated Regression Equation
7.2 The Least Squares Method
7.3 The Coefficient of Determination
The Correlation

Learning Objectives
• Use regression analysis to develop an equation relating two variables.
• Carry out predictions using the estimated regression Equation.
• Compute and interpret the correlation between the two variables.

Introduction
In decision making, managers usually rely on intuition to judge how two or more
variables are related. If data however, can be obtained, a statistical procedure called
regression analysis can be used to develop an equation showing how the variables
are related.

In regression terminology, the variable being predicted is called the dependent


variable. The variable or variables being used to predict the value of the dependant
variable is called the independent variable(s). In statistical notation 𝑦 denotes the
dependent variable and 𝑥 denotes the independent variable.

Simple linear regression is the simplest type of regression analysis involving one
independent variable and one dependent variable. Their relation is approximated
by a straight line.

The Simple Linear Regression Model


The Regression Model and the Regression Equation

The equation that describe how 𝑦 is related to 𝑥 and an error term is called the
regression model.

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀

1
𝑦 is the linear function of 𝑥. 𝛽0 and 𝛽1 are referred to as parameters of the model
and 𝜀 is a random variable referred to as the 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚. It accounts for the
variability in 𝑦 that cannot be explained by the linear relationship between 𝑥 and 𝑦.
If we assume that the mean or the expected value if 𝜀 is zero then:

𝐸(𝑦) = 𝛽0 + 𝛽1 𝑥
The graph of the simple linear regression equation is a straight line; 𝛽0 is the 𝑦
intercept of the regression line, 𝛽1is the slope, and 𝐸(𝑦) is the mean or expected
value of 𝑦 for a given value of𝑥.

The Estimated Regression Equation


Since 𝛽0 and 𝛽1 are usually unknown parameters, sample statistics denoted 𝑏0 and
𝑏1 are computed as estimates of 𝛽0 and 𝛽1. Substituting the values of the sample
statistics 𝑏0 and 𝑏1 in the regression equation, we obtain the estimated regression
equation

𝑦̂ = 𝑏0 + 𝑏1 𝑥 .

The graph of the estimated simple linear regression equation is called the estimated
regression line,
Where:
𝑏0 is the 𝑦 intercept
𝑏1 is the slope
𝑦̂ is the estimated value of y for a given value of 𝑥.

The Least Square Method


For the estimated regression line to provide a good fit to the data, we want the
difference between the observed 𝑦𝑖 values and the estimated 𝑦̂𝑖 values to be small.
This method uses the sample data to provide the values of 𝑏0 and 𝑏1 that minimize
the sum of the squares of the deviations between the observed values of the
dependent variable 𝑦𝑖 and the estimated values of the dependent variable ̂𝑦𝑖 .

b1 =
 ( x − x )( y − y )
i i
Or 𝑏1 =
(∑ 𝑥𝑖 ∑ 𝑦𝑖 )
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
Or 𝑏1 =
𝑛 ∑ 𝑥𝑖 𝑦𝑖 −∑ 𝑥𝑖 ∑ 𝑦𝑖

 ( x − x)
2
2 (∑ 𝑥 𝑖) 𝑛 ∑ 𝑥𝑖2 −(∑ 𝑥𝑖 )2
∑ 𝑥𝑖2 −
i 𝑛

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅

Where:
𝑥𝑖 = value of the independent variable for the ith observation.

2
𝑦𝑖 = value of the dependent variable for the ith observation.
𝑥̅ = mean value for the independent variable.
𝑦̅ = mean value for the dependent variable.
𝑛 = total number of observations.

Example
Below are data a sales manager has collected on annual sales and years of
experience.
Table 1

Years of Annual Sales


Salesperson Experience (N$1000S)
1 1 80
2 3 97
3 4 92
4 4 102
5 6 103
6 8 111
7 10 119
8 10 123
9 11 117
10 13 136

a. Estimate a regression equation that can be used to predict annual sales given
the years of experience.
b. Use the estimated regression equation to predict annual sales for a sales person
with nine years of experience.

3
Solution:
Table 2: Calculations for the least squares estimated regression equation.

Sales person Yrs of Exp (X) Annual Sales (Y) ( 𝑖 − ̅) ( 𝑖 − ̅) ( 𝑖 − ̅ )( 𝑖 − ̅ ) ( 𝑖 − ̅)2


1 1 80 -6 -28 168 36
2 3 97 -4 -11 44 16
3 4 92 -3 -16 48 9
4 4 102 -3 -6 18 9
5 6 103 -1 -5 5 1
6 8 111 1 3 3 1
7 10 119 3 11 33 9
8 10 123 3 15 45 9
9 11 117 4 9 36 16
10 13 136 6 28 168 36
7 108 0 0 568 142
Mean Mean Sum Sum Sum Sum

a. b1 =
 ( x − x )( y − y ) = 568 = 4.
i i
142
 ( x − x)
2
i

∑ 𝑥𝑖 70
𝑥̅ = = 10 = 7
𝑛
∑ 𝑦𝑖 1080
𝑦̅ = = = 108
𝑛 10
→ 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
= 108 − (4)(7)
= 80.

∴ 𝑦̂ = 𝑏0 + 𝑏1 𝑥
= 80 + 4𝑥.

b. For 𝑥 = 9
→ 𝑦̂ = 80 + (4)(9)
= 116.

Coefficient of Determination
In the previous example the estimated regression equation 𝑦𝑖 = 80 + 4𝑥 is used
to estimate the linear relationship between the years of experience (𝑥) and the
annual sales (𝑦). A question now is: How well does the estimated regression
equation fit the data?

The coefficient of determination provides a measure of the goodness of fit for the
estimated regression equation.

4
Definition
The difference between the observed value of the dependent variable 𝑦𝑖 and the
estimated value of the dependent variable 𝑦̂𝑖 , using the estimated regression
equation is called the ith residual; that is for the ith observation, the residual is 𝑦𝑖 – 𝑦̂𝑖 .

Note:
The ith residual represents the error in using 𝑦̂𝑖 to estimate 𝑦𝑖 . The sum of square of
these residual or errors is known as sum of squares due to errors and is denoted by
𝑆𝑆𝐸.

Sum of Squares Due to Error: 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥)2

Table 3 Calculation of 𝑺𝑺𝑬.

Years of Annual Sales


Salesperson Experience (N$1000S)
1 1 80 84 -4 16
2 3 97 92 5 25
3 4 92 96 -4 16
4 4 102 96 6 36
5 6 103 104 -1 1
6 8 111 112 -1 1
7 10 119 120 -1 1
8 10 123 120 3 9
9 11 117 124 -7 49
10 13 136 132 4 16
SSE = 170

Now suppose we want to estimate the annual sales without knowledge of any
related variable e.g. (Years of experience), we would use the sample mean as an
estimate of sales for any given sales person. For the ith salesperson in the sample, the
difference 𝑦𝑖 – 𝑦̅ provides a measure of the error involved in using 𝑦̅ to estimated
sales. The corresponding sum of square, called the total sum of squares, and is
denoted by 𝑆𝑆𝑇.

5
Total sum of Squares: 𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2

Years of Annual Sales


Salesperson Experience (N$1000S)
1 1 80 -28 784
2 3 97 -11 121
3 4 92 -16 256
4 4 102 -6 36
5 6 103 -5 25
6 8 111 3 9
7 10 119 11 121
8 10 123 15 225
9 11 117 9 81
10 13 136 28 784
SST= 2442

Sum of square due to Regression (Model)


This sum of squares, called the sum of squares due to regression is denoted by 𝑆𝑆𝑅.

𝑆𝑆𝑅 = ∑(𝑦̂ − 𝑦̅)2

Relationship Among SST, SSR and SSE

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

Where SST = Total sum of squares


SSR = Sum of squares due to regression (Model)
SSE = Sum of squares due to errors.

The estimated regression equation would provide a perfect fit if every value of the
dependent variable 𝑦𝑖 happened to lie on the estimated regression line, resulting in
𝑆𝑆𝐸 = 0.

𝑆𝑆𝑅 𝑆𝑆𝑅
For a perfect fit 𝑆𝑆𝑅 must equal 𝑆𝑆𝑇 and the ratio 𝑆𝑆𝑇 must equal one. The ratio 𝑆𝑆𝑇 ,
which will take values between zero and one, is used to evaluate the goodness of fit
for the estimated regression equation. This ratio is called the
𝑆𝑆𝑅
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 and is denoted by 𝑟 2 . Where, 𝑟 2 = 𝑆𝑆𝑇 .

6
When the coefficient of determination is expressed as a percentage, 𝑟 2 can be
interpreted as the percentage of the total sum of square that can be explained by
using the estimated regression equation.

Note: The total sum of square can be computed by using the following alternative
formula.
2
(∑ 𝑦𝑖 )
𝑆𝑆𝑇 = ∑ 𝑦𝑖2 −
𝑛

The sum of squares due to regression can be computed by using the following
alternative formula.
∑ 𝑥𝑖 ∑ 𝑦𝑖 2
[∑ 𝑥𝑖 𝑦𝑖 − ]
𝑛
𝑆𝑆𝑅 =
(∑ 𝑥𝑖 )2
∑ 𝑥𝑖2 −
𝑛

The Correlation Coefficient


The correlation coefficient is a descriptive measure of the strength of linear
association between two variables, x and y. Values of the correlation coefficient are
always between -1 and +1. A value of +1 indicates that the two variables 𝑥 and 𝑦 are
perfectly related in a positive linear sense. That is, all data points are on a straight
line that has a positive slope. A value of -1 indicates that 𝑥 and 𝑦 are perfectly
related in a negative linear sense; with all data points on a straight line that has a
negative slope. Values of the correlation coefficient close to zero indicate that 𝑥 and
𝑦 are not linearly related.

If a regression analysis has already been performed and the coefficient of


determination 𝑟 2 has been computed, the sample correlation coefficient can be
computed as follows.

Sample correlation coefficient (𝑟) = (𝑠𝑖𝑔𝑛 𝑜𝑓 𝑏1 )√𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛


= (𝑠𝑖𝑔𝑛 𝑜𝑓 𝑏1 )√𝑟 2 .

Where 𝑏1 = the slope of the estimated regression equation𝑦̂ = 𝑏0 + 𝑏1 𝑥.

Example
Consider the following data.

𝑥𝑖 2 3 5 1 8
𝑦𝑖 25 25 20 30 16

7
The estimated regression equation for these data is 𝑦̂ = 30.33 − 1.88𝑥.
a. Compute 𝑆𝑆𝐸, 𝑆𝑆𝑇, and 𝑆𝑆𝑅.
b. Compute the coefficient of determination 𝑟 2 .
c. Compute the sample correlation coefficient.

Solution

2 25 26.57 2.46 3.24 11.36 4 625 50


3 25 24.69 0.10 3.24 2.22 9 625 75
5 20 20.93 0.86 10.24 5.15 25 400 100
1 30 28.45 2.40 46.24 27.56 1 900 30
8 16 15.29 0.50 51.84 62.57 64 256 128
19 116 6.33 114.8 108.86 103 2806 383

3.8
23.2

a. 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = 6.33

𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2 = 114.8

𝑆𝑆𝑅 = ∑(𝑦̂𝑖 − 𝑦̅)2 = 108.86


𝑆𝑆𝑅 108.86
b. 𝑟 2 = 𝑆𝑆𝑇 = = 0.948
114.8

c. 𝑟 = (𝑠𝑖𝑔𝑛 𝑜𝑓 𝑏1 )√𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛

= (𝑠𝑖𝑔𝑛 𝑜𝑓 𝑏1 )√𝑟 2.
(∑ 𝑥𝑖 ∑ 𝑦𝑖 ) (19)(116)
∑ 𝑥𝑖 𝑦𝑖 − 383−
𝑛 5
But 𝑏1 = 2 = (19)2
= −1.887
(∑ 𝑥𝑖 ) 103−
∑ 𝑥𝑖2 − 5
𝑛

∴ 𝑟 = (−)√0.948 = −0.972.

In the case of a linear relationship between variables, both the coefficient of


determination and the sample correlation coefficient provide measures of the
strength of the relationship. The coefficient of determination provides a measure
between zero and one whereas the sample correlation coefficient provides a
measure between -1 and +1. Although the sample correlation coefficient is
restricted to a linear relationship between two variables, the coefficient of
8
determination can be used for nonlinear relationships and for relationships that
have two or more independent variables. In that case the coefficient of
determination has a wider range of applicability.

TESTING FOR SIGNIFICANCE


Recall that 𝑆𝑆𝐸 = ∑( 𝑖 − ̂ )2 = ∑( 𝑖 − 𝑏0 − 𝑏1 𝑖 )2 .
It has 𝑛 − 2 degree of freedom since two parameters (𝛽0 𝑎𝑛𝑑 𝛽1 ) must be estimated
to compute SSE. Thus, the mean square is calculated by dividing SSE by its degree of
𝑆𝑆𝐸
freedom: Mean Square Error = 𝑠Ɛ2 = 𝑀𝑆𝐸 = 𝑛−2 .

T-Test: Testing the slope


The simple linear regression model is = 𝛽0 + 𝛽1 + 𝜀 . If x and y are linearly
related, then 𝛽1 ≠ 0. In other words, 𝛽1 is statistically significant. We will use the
sample data to test the following hypothesis about the population parameter 𝛽1.

𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0

If 𝐻0 is rejected, we will conclude that 𝛽1 ≠ 0 and that these is a statistically


significant relationship between 𝑥 and 𝑦.

𝑏1 −𝛽1 𝑏1
Test statistics: 𝑡 − 𝑠𝑡𝑎𝑡𝑠 = = (iff β1 = 0)
𝑠𝑏1 𝑠𝑏1

𝑆Ɛ
Where, 𝑠𝑏1 is the standard deviation of 𝑏1 , defined as: 𝑠𝑏1 =
√(𝑛−1)𝑠𝑥2

If the error term (Ɛ) is normally distributed, the test statistic has a Student t-
distribution with n-2 degree of freedom. If H0 is true then 1 =0 and t is greater than
2 (Rule of thumb).

The t-test steps for a significant relationship are as follows:

I. Specify your hypothesis


II. Compute the test-statistic
𝑏1
𝑡 − 𝑠𝑡𝑎𝑡𝑠 = ~ 𝑡 − 𝑐𝑟𝑖𝑡(𝛼; 𝑑𝑓=𝑛−2)
𝑠𝑏1 2

9
III. Rejection rule: Reject H0 if t-stat>t-critical where t-critical is based on the
student t-distribution with n-2 degree of freedom. You can also apply the rule of
thumb which states: reject H0 if t-stats is greater than 2 and n ≥ 30.

NB: you can also apply the p-value approach: Reject H0 if p-value ≤ α

F-test
An F-test, based on the F probability distribution, can also be used to test for
significance in regression. In fact, 𝐹 = 𝑡 2 for a simple regression model. The F-test
provides the same conclusion as the t-test. The F-test can be used to test for an
overall significant relationship of the model.

𝑀𝑆𝑅
The test statistic is given as: 𝐹 = 𝑀𝑆𝐸

Where, MSR is the mean square due to regression, or simply the mean square
𝑆𝑆𝑅
regression obtained as 𝑀𝑆𝑅 = ; and MSE is the Mean Square Error obtained by
1
𝑆𝑆𝐸
𝑀𝑆𝐸 = 𝑛−2 .

The F-test steps are as follows:

Where F is based on an F distribution with 1 degree of freedom in the numerator,


and n-2 degrees of freedom in the denominator.

We can construct an ANOVA table to summarise the result of the F test for
significance in regression.

General form of the ANOVA TABLE for Simple Linear Regression

Source of degrees of Sums of


Mean Squares F-Statistic
Variation freedom Squares

Regression 1 SSR MSR = SSR/1 F=MSR/MSE

Error n–2 SSE MSE =SSE/(n–2)

Variation
Total n–1
in y (SST)

10
UNIT SUMMARY
I. The graph of the estimated simple linear regression equation is called the
estimated regression line, 𝑦̂ = 𝑏0 + 𝑏1 𝑥.

Where:
𝑏0 is the 𝑦 intercept
𝑏1 is the slope
𝑦̂ is the estimated value of y for a given value of 𝑥.

(∑ 𝑥𝑖 ∑ 𝑦𝑖 )

b1 =
 ( x − x )( y − y )
i i Or 𝑏1 =
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
2

 ( x − x)
(∑ 𝑥 𝑖)
∑ 𝑥𝑖2 −
2
i 𝑛

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ .

II. The coefficient of determination provides a measure of the goodness of fit for
the estimated regression equation

III. The sum of square of these residual or errors is known as sum of squares due to
errors and is denoted by 𝑆𝑆𝐸.

IV. The corresponding sum of square, called the total sum of squares, and is
denoted by 𝑆𝑆𝑇 . (𝑦𝑖 – 𝑦̅ ) provides a measure of the error involved in using 𝑦̅ to
estimated sales

VI. The sum of squares, called the sum of squares due to regression is denoted by
𝑆𝑆𝑅.

𝑆𝑆𝑅
VII. The ratio , which will take values between zero and one, is used to evaluate
𝑆𝑆𝑇
the goodness of fit for the estimated regression equation. This ratio is called the
𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 𝒐𝒇 𝒅𝒆𝒕𝒆𝒓𝒎𝒊𝒏𝒂𝒕𝒊𝒐𝒏 and is denoted by 𝑟 2 .

The correlation coefficient is a descriptive measure of the strength of linear


association between two variables, 𝑥 and 𝑦. A value of +1 indicates that the two
variables 𝑥 and 𝑦 are perfectly related in a positive linear sense. A value of -1
indicates that 𝑥 and 𝑦 are perfectly related in a negative linear sense

11
SELF-TEST EXERCISES
Question 1
The following sample data contains the number of years spent at university and the
current annual salary for a random sample of Electronic equipment salespeople
employed by HI-Tech Corporation.

Annual Income
Years spent at University (In Thousands)
2 20
2 23
3 25
4 26
3 28
1 29
4 27
3 30
4 33
4 35
Required:
a. Which variable is the dependent variable? Which is the independent variable?
b. Determine the least squares estimated regression line.
c. Predict the annual income of a salesperson with one year of college.
d. Calculate the coefficient of determination.
e. Calculate the sample correlation coefficient between income and years of
college. Interpret the value you obtain.

Question 2
Below you are given information on a woman's age and her annual expenditure on
purchase of books.
Age Annual Expenditure ($)
18 210
22 180
21 220
28 280

Required:
a. Develop the least squares regression equation.
b. Compute SSR, SSE and SST.
c. Compute the coefficient of determination.
d. Compute the sample correlation coefficient.

12

You might also like