325unit 1 Simple Regression Analysis
325unit 1 Simple Regression Analysis
1 Introduction
Regression analysis is method of predicting value of one variable from other variables by
fitting a statistical model to the data in the form of straight line which best summarize the
pattern of data. Thus, regression analysis is used to predict or estimate the value of one
dependent or response or endogenous variables based on the known values of independent
or explanatory or regressor variables. The unknown variable which needs to estimate or
predict is called dependent variable and denoted by ‘y’. The variable which is used to
estimate the value of dependent variable (y) is called independent variable and denoted by
‘x’.
When the regression analysis is used to measure the strength of relationship
between one dependent (y) and one independent (x) variables then it is called simple
regression analysis.
Where
y = dependent variable
x1 = independent variable.
0 = y-intercept for the population.
1 = slope for the population. i.e. Regression coefficients of dependent variably (y) on
independent variable (x1).
e = error term, is the difference between the observed and estimated value of the dependent
variable (y).
To obtain the best fit of the regression model of y on x, we need the value of 0 and 1,
which are unknown. By using the principle of lest square, we can get two normal equation
of regression model (1).
The two normal equation of regression line (1) are
y nb 0 b1 x1 (2)
x y b x
1 b1 x1 (3)
0 1
2
b1
SSxy
( x x )( y y ) (5)
1 1
SSx (x x ) 1 1
2
Or
b1 x y nx y (6)
1 1
x nx
2 2
1 1
After finding the value of b0 and b1, we get the required fitted regression model of y on x as
yˆ b0 b1 x1
Where
ŷ = estimated value of dependent variable (y) for some given value of independent
variable (x1)
x1 = independent variable.
b0 = estimated value of 0 i.e. y- intercept.
b1 = estimated value of 1 i.e. regression coefficient of y on x1 or slope of the regression
line.
n = Number of pairs of data.
x1 = mean of the independent variable
y = mean of the dependent variable.
Note : If in the above model b1 = 3, this means, the value of dependent variable
(y) is increased by 3 while the value of independent variable (x1) is increase by 1.
5 Measures of Variation:
To examine the ability of the independent variable to predict the dependent variable (y) in
the regression model, several measures of variation need to be developed. In a regression
analysis, the total variation or total sum of squares (SST) is further divided into explained
variation or regression sum of squares (SSR) and unexplained variation or error sum of
squares (SSE). These different measures of variation are shown in the following figure.
Y
SSE
Y-axis
SST
yˆ b0 b1 x1
SSR
X-axis
Se
SSE
( y yˆ ) 2
(8)
n2 n2
c. If the value of Se is small, this means there is less variation of the observed data
around the regression line. So the regression line will be better for predicting the
dependent variable.
If Se = 2.5, this means, the average variation of the observed data around the regression line
is 2.5.
Where
ŷ = Estimated value of the dependent variable for a given value of independent
variable.
Se
S b1 (10)
x 2
nx 2
1 ( x x )2
h (11)
n x 2 n.x 2
Null hypothesis (H0): 1 = 0 (This means there is no linear relationship between dependent
and independent variables)
If null hypothesis is accepted then you can conclude that there is no linear relationship
between dependent and independent variables. But if alternative hypothesis is accepted then
you can conclude that there is a significant linear relationship between dependent and
independent variables.
Test Statistics:
b1
t cal
S b1
This test statistics follows t-distribution with (n-2) degree of freedom.
Decision: if the calculated value of the test statistics (t cal) is less than tabulated value (t tab)
then null hypothesis is accepted otherwise null hypothesis is rejected. i.e.
If tcal < t, n-2, then null hypothesis is accepted. Otherwise alternative hypothesis is accepted.
Where
t, n-2 = Tabulated value of ‘t’ at (n-2) degree of freedom and ‘’ level of significance,
obtained from two tailed t-table.
n = number of pairs of data.
= level of significance
b1 = Regression coefficients of y on x1.
S b = Standard error of the regression coefficient (b1)
1
Se
Sb1
SSx
Other notations have their usual meanings.
Interpretation of the coefficient of determination (r2): The regression model having the
higher value of coefficient of determination is better, more reliable than the regression
model having the smaller value of coefficient of determination, this means higher value of
r2 is better than lesser value of r2. It is the indication of how well the model fit the data.
For example if r2 = 0.91, this means 91% of the total variation in the dependent
variable (y) is due to the variation in the independent variable (x 1) and remaining 9%
variation in the dependent variable is due to the other factor which are not accounted in the
independent variable.
ZPRED: This is the standardized predicted values of the dependent variable based on the
model. These values are standardized forms of the values predicted by the model.
ZRESID: This is the standardized residual or error i.e. these values are the standardized
difference between the observed data and the values that the model predicts.
Histogram of ZRESID: This is useful for checking the assumption of normality of error.
Normal Probability Plot (P-P plot): This also provides information about whether the
residual in the model are normally distributed.
Q3. A consultant is interested in seeing how accurately a new job performance index
measured, what is important for a corporation. One way to cheek is to look at the
relationship between the job evaluation index and an employee’s salary. A sample of eight
employee’s was taken and information about salary (in thousands of Rs.) and job
performance index (1-10; 10 is best) was collected.
Job performance index: 9 7 8 4 7 5 5 6
Salary index : 12 7 8 3 6 2 4 6
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Estimate the salary index when job performance index is 8
g. Test the significance of the regression coefficient at 5% level of significance.
h. Obtain the 95% confidence interval estimate of the slope.
i. Obtain the confidence interval estimate for the mean index of salary for x=6.
j. Obtain the 95% approximate prediction interval of Y for the value of x=6.
Q4. Sales of major appliances vary with the new housing market: when new home sales are
good, so are the sales of dishwashers, washing machines, driers, and refrigerators. A trade
association compiled the following historical data (in thousands of units) on major
appliance sales and housing starts:
Housing starts (thousands): 2.0 2.5 3.2 3.6 3.3 4.0 4.2 4.6
4.8
Appliance sales (thousands): 5 5.5 6 7 7.2 7.7 8.4 9
9.7
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Test the significance of the research question at 5% level of significance.
g. Compute the 90% prediction interval for the appliance sales when housing is 8.0
h. Compute the coefficient of determination and coefficient of correlation and interpret
the value.
Q5. A study by the department of transportation on the effect of bus ticket price upon the
number of passengers produced the following results
Ticket price (Rs.): 25 30 35 40 45 50 55 60
Passenger per 100 miles: 800 780 780 660 640 600 620 620
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Predict the number of passengers per 100 miles if the ticket price were Rs. 50. And
also obtain the 95% approximate prediction intervals for ticket price Rs 50.
Q6. Campus stores have been selling the Believe it or not. Wonders of statistics study
Guide for 10 semesters and would like to estimate the relationship between sales and
number of sections of elementary statistics taught in each semester. The following data
have been collected
Sales (units): 33 38 24 61 52 45 65 82 29
63
Number of sections: 3 7 6 6 10 12 12 13 12
13
a. Determine the independent and dependent variable.
b. Draw the research model and write the research hypothesis
c. Fit the model using above data.
d. Interpret the findings of the model.
e. Assess the model using standard error of the estimate and coefficient of
determination.
f. Test the significance of the research model at 5% level of significance.