0% found this document useful (0 votes)
102 views

325unit 1 Simple Regression Analysis

Simple regression analysis is used to predict a dependent variable from a single independent variable. It fits a straight line to the data in the form of a regression equation. The key outputs are: 1) The regression coefficients (β0 and β1), which represent the y-intercept and slope. β1 measures how the dependent variable changes with the independent variable. 2) Measures of variation like total, regression, and error sum of squares, which assess how well the model fits the data. 3) The standard error, which measures the average variation of observed data from the regression line. A smaller standard error means better predictability. Confidence intervals and hypothesis tests are also used to evaluate

Uploaded by

utsav
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

325unit 1 Simple Regression Analysis

Simple regression analysis is used to predict a dependent variable from a single independent variable. It fits a straight line to the data in the form of a regression equation. The key outputs are: 1) The regression coefficients (β0 and β1), which represent the y-intercept and slope. β1 measures how the dependent variable changes with the independent variable. 2) Measures of variation like total, regression, and error sum of squares, which assess how well the model fits the data. 3) The standard error, which measures the average variation of observed data from the regression line. A smaller standard error means better predictability. Confidence intervals and hypothesis tests are also used to evaluate

Uploaded by

utsav
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

Simple Regression Analysis

Simple Regression Analysis

1 Introduction
Regression analysis is method of predicting value of one variable from other variables by
fitting a statistical model to the data in the form of straight line which best summarize the
pattern of data. Thus, regression analysis is used to predict or estimate the value of one
dependent or response or endogenous variables based on the known values of independent
or explanatory or regressor variables. The unknown variable which needs to estimate or
predict is called dependent variable and denoted by ‘y’. The variable which is used to
estimate the value of dependent variable (y) is called independent variable and denoted by
‘x’.
When the regression analysis is used to measure the strength of relationship
between one dependent (y) and one independent (x) variables then it is called simple
regression analysis.

2 Regression line (Regression Model)


A simple linear regression line between one dependent variable (y) and one independent
variable (x1) is written as
y   0  1 x1  e........(1)

Where
y = dependent variable
x1 = independent variable.
0 = y-intercept for the population.
1 = slope for the population. i.e. Regression coefficients of dependent variably (y) on
independent variable (x1).
e = error term, is the difference between the observed and estimated value of the dependent
variable (y).

To obtain the best fit of the regression model of y on x, we need the value of 0 and 1,
which are unknown. By using the principle of lest square, we can get two normal equation
of regression model (1).
The two normal equation of regression line (1) are
 y  nb 0  b1  x1    (2)

x y b x
1  b1  x1    (3)
0 1
2

By solving these two normal equations we get the value of b0 and b1 as


b0  y  b1 x1    (4)

b1 
SSxy

 ( x  x )( y  y )    (5)
1 1

SSx  (x  x ) 1 1
2

Or

b1   x y  nx y    (6)
1 1

 x  nx
2 2
1 1

After finding the value of b0 and b1, we get the required fitted regression model of y on x as
yˆ  b0  b1 x1
Where

Data Analysis and Modeling for Managerial Decisions 1


Simple Regression Analysis

ŷ = estimated value of dependent variable (y) for some given value of independent
variable (x1)
x1 = independent variable.
b0 = estimated value of 0 i.e. y- intercept.
b1 = estimated value of 1 i.e. regression coefficient of y on x1 or slope of the regression
line.
n = Number of pairs of data.
x1 = mean of the independent variable
y = mean of the dependent variable.

3 Interpreting the regression coefficients:


Suppose following is the fitted simple regression model
yˆ  15  3 x1
a. The coefficient ‘b0’ (estimated value of 0) represents the average value of the
dependent variable (y) when value of independent variable (x1) is zero.
For example, in the above model, b0 = 15, this means, the average value of the
dependent variable (y) is 15 when x1 = 0.
b. The regression coefficient ‘b1’ (estimated value of 1) measure the average rate of
increase or decrease in the value of dependent variable (y) while increasing the
value of independent variable (x1) by unit.
For example, in the above model, b1 = -3, this means , the value of dependent
variable (y) is decreased by 3 while the value of independent variable (x 1) is
increase by 1.

Note : If in the above model b1 = 3, this means, the value of dependent variable
(y) is increased by 3 while the value of independent variable (x1) is increase by 1.

4 Error term (Residual):


The difference between the observed and estimated value of the dependent variable (y) is
called error or residual and it is denoted by ‘e’
e  y  yˆ
Where
e = Error term
y = Observed value of the dependent variable.
ŷ = Estimated value of the dependent variable for a given value of independent variable.

5 Measures of Variation:
To examine the ability of the independent variable to predict the dependent variable (y) in
the regression model, several measures of variation need to be developed. In a regression
analysis, the total variation or total sum of squares (SST) is further divided into explained
variation or regression sum of squares (SSR) and unexplained variation or error sum of
squares (SSE). These different measures of variation are shown in the following figure.

Data Analysis and Modeling for Managerial Decisions 2


Simple Regression Analysis

Y
SSE
Y-axis

SST
yˆ  b0  b1 x1

SSR

X-axis

From the figure, mathematically


Total Sum of Square (SST) = Regression Sum of Square (SSR) + Error Sum of Square
(SSE) i.e.
SST  SSR  SSE    (7)
Where,
SST   ( y  y ) 2   y 2  n. y 2
SSR   ( yˆ  y ) 2  b0  y  b1  x1 y  n. y 2
SSE   ( y  yˆ ) 2   y 2  b0  y  b1  x1 y

6 Standard error of the estimate (Se or Sy.x)


The standard error of the estimate measures the average variation of the observed data point
around the regression line. Standard error of the estimate is used to assess or measure the
reliability of the regression equation and it is denoted by Se or Sy.x and is calculated by using
the following equation.

Se 
SSE
  ( y  yˆ ) 2

   (8)
n2 n2

Interpreting the standard error of the estimate:


The regression line having the lesser value of the standard error of the estimate is more
reliable then the regression line having the higher value of the standard error of the estimate
i.e. how much the value of the standard error of the estimate is less, the fitted regression
line is more reliable.
a. Is Se = 0, this means there is no variation of the observed data around the regression
line i.e. all the observed data lies in the regression line. So we expect that the
regression line is perfect for predicting the dependent variable.
b. If the value of Se is large then fitted regression line is poor for predicting the
dependent variable since there is greater variation of the observed data around the
regression line.

Data Analysis and Modeling for Managerial Decisions 3


Simple Regression Analysis

c. If the value of Se is small, this means there is less variation of the observed data
around the regression line. So the regression line will be better for predicting the
dependent variable.
If Se = 2.5, this means, the average variation of the observed data around the regression line
is 2.5.

7 Confidence Interval Estimate


a. Confidence interval for Y-intercept (0)
b0  t n  2, S b   0  b0  t n  2, S b
0 0

b. Confidence interval for the regression coefficient or slope (1)


b1  t n  2, S b1  1  b1  t n  2, S b1
c. Confidence interval estimate for the mean of dependent variable (y)
yˆ  t n  2 , S e h  Y / X  x  yˆ  t n  2, S e h

d. Prediction interval for an individual response of dependent variable (y)


yˆ  t n  2, S e 1  h  YX  x  yˆ  t n  2, S e 1  h
e. Approximate prediction interval: This interval gives within which the actual
value of the dependent variable (Y) lies for a given value of the independent
variable.
yˆ  t n  2 , S e  YX  x  yˆ  t n  2 , S e

Where
ŷ = Estimated value of the dependent variable for a given value of independent
variable.

S b0 = Standard error of y-intercept (b0)


S b1 = Standard error of the regression coefficient (b1)
x2
Se n
Sb0     (9)
 x 2
 n x 2

Se
S b1     (10)
x 2
 nx 2
1 ( x  x )2
h     (11)
n  x 2  n.x 2

Se = Standard error of the estimate


t n  2 , = Tabulated value of the‘t’ obtained from two tailed student’s t-table at (n-2) degree
of freedom and ‘’ level of significance.
n = Number of pairs of observations.
Other notations have their usual meanings.

Data Analysis and Modeling for Managerial Decisions 4


Simple Regression Analysis

8 Test of significance for the regression coefficient (1):


To determine the existence of a significant linear relationship between the dependent
variable (y) and independent variable (x1), a hypothesis test concerning the population slope
(1 i.e. regression coefficient) is made by setting the null and alternative hypothesis as
stated below.

Null hypothesis (H0): 1 = 0 (This means there is no linear relationship between dependent
and independent variables)

Alternative hypothesis (H1): 1  0 (This means there is a significant linear relationship


between dependent and independent variable.) (Two tailed)

If null hypothesis is accepted then you can conclude that there is no linear relationship
between dependent and independent variables. But if alternative hypothesis is accepted then
you can conclude that there is a significant linear relationship between dependent and
independent variables.

Data Analysis and Modeling for Managerial Decisions 5


Simple Regression Analysis

Test Statistics:
b1
t cal 
S b1
This test statistics follows t-distribution with (n-2) degree of freedom.

Decision: if the calculated value of the test statistics (t cal) is less than tabulated value (t tab)
then null hypothesis is accepted otherwise null hypothesis is rejected. i.e.
If tcal < t, n-2, then null hypothesis is accepted. Otherwise alternative hypothesis is accepted.

Where
t, n-2 = Tabulated value of ‘t’ at (n-2) degree of freedom and ‘’ level of significance,
obtained from two tailed t-table.
n = number of pairs of data.
 = level of significance
b1 = Regression coefficients of y on x1.
S b = Standard error of the regression coefficient (b1)
1

Se
Sb1 
SSx
Other notations have their usual meanings.

9 Coefficient of determination (r2):


The coefficient of determination measures the strength or extent of the association that
exists between dependent variable (y) and independent variable (x1). It measures the
proportion of variation in the dependent variable (y) that is explained by the regression line.
In other word, coefficient of variation measures the total variation in the dependent variable
due to the variation in the independent variable and it is denoted by ‘r2’. The following
relations are used to obtain the value of coefficient of determination.
SSR
r2       (12)
SST

Note: Since coefficient of determination is the square of the Correlation coefficient. So


correlation coefficient is the square root of the coefficient of determination and can be
obtained from the coefficient of determination by the following relation.
r r 2   r    (13)
If the regression coefficient (b1) is negative then take the negative sign.
If the regression coefficient (b1) is positive then take the positive sign.

Adjusted coefficient of determination (r2adj.): The adjusted coefficient of determination is


calculated by using the following relation.
n 1
r 2 ad j  1  (1  r 2 )    (14)
n2

Interpretation of the coefficient of determination (r2): The regression model having the
higher value of coefficient of determination is better, more reliable than the regression

Data Analysis and Modeling for Managerial Decisions 6


Simple Regression Analysis

model having the smaller value of coefficient of determination, this means higher value of
r2 is better than lesser value of r2. It is the indication of how well the model fit the data.
For example if r2 = 0.91, this means 91% of the total variation in the dependent
variable (y) is due to the variation in the independent variable (x 1) and remaining 9%
variation in the dependent variable is due to the other factor which are not accounted in the
independent variable.

10 Assumption on Regression Analysis:


The following three assumptions are necessary for the regression analysis. Which are described
as,
i) Linearity
ii) Normality of errors
iii) Homoscedasticity
iv) Independence of errors

i) Linearity: The outcome or dependent variable should, in reality, be linearly related


to any predictors or independent variables and with several predictor or independent
variables.
ii) Normality of errors: This assumption requires that, the errors around the
regression line be normally distributed for each value of X (independent variables).
As long as the distribution of the errors around the regression line for each value of
independent variables in not extremely different from a normal distribution, then
inference about the line of regression and regression coefficients will not be
seriously affected.
iii) Homoscedasticity: This assumption requires that the variation around the line of
regression be constant for all values of independent variables(X). This means that
the errors vary with the same amount for constant change of independent variable.
The Homoscedasticity assumption is important for using the least square method to
fit the regression line. If there are serious departures from this assumption, either
data needs to be transformed or weighted to used least square method to the data.
iv) Independence of errors: This assumption requires that the errors around the
regression line be independent for each value of explanatory variables. This is
particularly important when data are collected over a period of time. In such
situation, errors for a specific time period are often correlated with those of the
previous time period.

Sample size in Regression


You will find a lot of rules of thumps floating around the sample size in regression analysis, but
the two most common being that you should have 10 cases of data for each predictor in the
model or 15 cases of data per predictors. It should be kept in mind that the biggest rule of
thumb is that the bigger the sample size, the better the results.

Method of testing the assumption of Regression Analysis using SPSS:

Durbin Watson test:


This test is used to test the assumption of Independence of error. Unfortunately, SPSS does not
provide the significance value of this test, so you must decide whether the value is different
enough from 2 to be cause for concerns. The assumption that errors are independent is likely to
be met if the Durbin-Watson statistics is close to 2 (and between 1 and 3).

Data Analysis and Modeling for Managerial Decisions 7


Simple Regression Analysis

Graph between ZPRED and ZRESID:


The graph between ZRESID along y-axis and ZPRED along x-axis is useful for testing the
assumption of independence of errors, homoscedasticity and linearity.
Look at the graph of ZRESID plotted against ZPRED.
 If it looks like a random array dots then this is good.
 If the dots seem to get more or less spread out over the graph (look like a funnel) then
this is probably a violation of the assumption of homogeneity of variance.
 If the dots have a pattern to them (i.e. a curved shape) then this is probably a violation
of the assumption of linearity.
 If the dots seem to have pattern and are more spread out at some points on the plot than
others then this probably reflects violations of both homogeneity of variance and
linearity.
 Any of these scenarios puts the validity of your model into questions.

ZPRED: This is the standardized predicted values of the dependent variable based on the
model. These values are standardized forms of the values predicted by the model.
ZRESID: This is the standardized residual or error i.e. these values are the standardized
difference between the observed data and the values that the model predicts.

Histogram of ZRESID: This is useful for checking the assumption of normality of error.

Normal Probability Plot (P-P plot): This also provides information about whether the
residual in the model are normally distributed.

Look at histogram and P-P plot.


If the histogram looks like normal distribution (and P-P plot looks like a diagonal line), then all
is well. If the histogram looks non-normal and the P-P plot looks like a wiggly snake curving
around a diagonal line then things are less good.

Detecting the outlier which may influence the result of regression:


Look at standardized residual (ZRESID) and check than no more than 5% of cases have
absolute values greater than 2 and that no more than about 1% have absolute values above 2.5.
Any case with a value above about 3 could be an outlier.

Data Analysis and Modeling for Managerial Decisions 8


Simple Regression Analysis

Problems on Simple Regression Analysis:


Q1. An instructor is interested in finding out how the number of students absent on a given
day is related to the mean temperature that day. A random sample of 10 days was used for
the study. The following data indicate the number of students absent (ABS) and the mean
temperature (TEMP) for each day.
ABS: 8 7 5 4 2 3 5 6 8 9
TEMP: 10 20 25 30 40 45 50 55 59 60
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Compute the residual when X= 25
Q2. Cost accounts often estimating overhead based on the level of production. At the
standard knitting co., they have collected information on overhead expenses and units
produced at different plants, and want to estimate a regression equation to predict future
overhead.
Overhead; 191 170 272 155 280 173 234 116 153
178
Units: 40 42 53 35 56 39 48 30 37
40
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Test the significance of the slope at 5% level of significance.
f. Assess the model using standard error of the estimate and coefficient of
determination.
g. Predict overhead when 50 units are produced.

Q3. A consultant is interested in seeing how accurately a new job performance index
measured, what is important for a corporation. One way to cheek is to look at the
relationship between the job evaluation index and an employee’s salary. A sample of eight
employee’s was taken and information about salary (in thousands of Rs.) and job
performance index (1-10; 10 is best) was collected.
Job performance index: 9 7 8 4 7 5 5 6
Salary index : 12 7 8 3 6 2 4 6
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Estimate the salary index when job performance index is 8
g. Test the significance of the regression coefficient at 5% level of significance.
h. Obtain the 95% confidence interval estimate of the slope.
i. Obtain the confidence interval estimate for the mean index of salary for x=6.
j. Obtain the 95% approximate prediction interval of Y for the value of x=6.

Data Analysis and Modeling for Managerial Decisions 9


Simple Regression Analysis

Q4. Sales of major appliances vary with the new housing market: when new home sales are
good, so are the sales of dishwashers, washing machines, driers, and refrigerators. A trade
association compiled the following historical data (in thousands of units) on major
appliance sales and housing starts:
Housing starts (thousands): 2.0 2.5 3.2 3.6 3.3 4.0 4.2 4.6
4.8
Appliance sales (thousands): 5 5.5 6 7 7.2 7.7 8.4 9
9.7
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Test the significance of the research question at 5% level of significance.
g. Compute the 90% prediction interval for the appliance sales when housing is 8.0
h. Compute the coefficient of determination and coefficient of correlation and interpret
the value.
Q5. A study by the department of transportation on the effect of bus ticket price upon the
number of passengers produced the following results
Ticket price (Rs.): 25 30 35 40 45 50 55 60
Passenger per 100 miles: 800 780 780 660 640 600 620 620
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Predict the number of passengers per 100 miles if the ticket price were Rs. 50. And
also obtain the 95% approximate prediction intervals for ticket price Rs 50.
Q6. Campus stores have been selling the Believe it or not. Wonders of statistics study
Guide for 10 semesters and would like to estimate the relationship between sales and
number of sections of elementary statistics taught in each semester. The following data
have been collected
Sales (units): 33 38 24 61 52 45 65 82 29
63
Number of sections: 3 7 6 6 10 12 12 13 12
13
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Test the significance of the research model at 5% level of significance.

Data Analysis and Modeling for Managerial Decisions 10

You might also like