DISC 203 – PROBABILITY &
STATISTICS
SIMPLE LINEAR
REGRESSION
1 Lecturer: Muhammad Asim
SIMPLE LINEAR REGRESSION
MODEL
Slope Random
Intercept Independe Error
Coefficien
nt Variable term
t
Dependent
Variable
Yi β0 β1Xi ε i
Deterministic component Random Error
component
EXAMPLE
You are a marketing analyst for Teddy Bears. You
gather the following data and want to find a
simple relationship between advertising and
sales.
Advertising - Sales Data
Month Advertising Sales
Expenditure Revenue y
x ($100) ($1,000)
1 1 1
2 2 1
3 3 2
4 4 2 3
5 5 4
SCATTERGRAM
SALES VS. ADVERTISING
Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
4
SCATTERGRAM
SALES VS. ADVERTISING
5
LEAST SQUARES ESTIMATORS
Prediction equation
yˆi ˆ0 ˆ1 xi
Sample slope
SS xy xi x yi y
ˆ1
2
SS xx ix x
Sample y - intercept y y
SS xy x x
x x
2
SS xx
ˆ0 y ˆ1x SS yy y y
2
6
COMPUTATIONS – LEAST SQUARES LINE
xi (adv) yi (sales) (xi - 3)2 (xi - 3)(yi - 2)
1 1 4 2
2 1 1 1
3 2 0 0
4 2 1 0
5 4 4 4
∑xi = 15 ∑yi = 10 SSxx= ∑(xi - 3) SSxy=∑(xi - 3)(yi -
Mean = 3 Mean = 2 2
2)
= 10 =7
7
COEFFICIENT INTERPRETATIONS
^
1. Slope (1)
• Sales Volume (y) is expected to increase by $ 700
for each $100 increase in advertising (x), over the
sampled range of advertising expenditures from
$100 to $500
^
2. y-Intercept (0)
• Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation 8
MEASURES OF VARIATION
SST = total sum of squares
Measures the variation of the yi values around their
mean, y
SSR = regression sum of squares
Explained variation attributable to the linear
relationship between x and y
SSE = error sum of squares
Variation attributable to factors other than the linear
relationship between x and y
MEASURES OF VARIATION
Total variation is made up of two parts:
SST SSR SSE
Total Sum of Regression Sum Error Sum of
Squares of Squares Squares
SST (y i y)2 SSR (yˆ i y)2 SSE (y i yˆ i )2
where:
y = Average value of the dependent
variable
ŷ = Observed values of the dependent
y i
Advertising Sales
yhat=b0+b1* (y- Revenue
x yhat)^2 Expenditur y ($1,000)
(y-
yhat) e x ($100)
n=5
11
STANDARD ERROR OF THE
REGRESSION MODEL
2 SSE SSE
s
Degrees of freedom for error n 2
We refer to s as the standard error of the
regression model
s measures the spread of the distribution of y
values about the least squares line
We expect most of the observed y-values to
lie within 2s of their respective least squares
predicted values
CALCULATING S2 AND S
2 SSE 1.1
s .36667
n 2 5 2
s .36667 .6055
We would expect most of the observed revenues to
fall within 2s or $1,220 of the least squares line.
13
COEFFICIENT OF DETERMINATION, R2
The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
The coefficient of determination is also
called R-squared and is denoted as R2
SSR regression sum of squares
2
R
SST total sum of squares
2
note: 0 R 1
R2 INTERPRETATION
R2 = SSR/SST = 0.82
Interpretation: About 82% of the sample variation in
Sales can be explained by Advertising Expenditures,
using the linear regression model.
15
RECAP
Yi β0 β1Xi ε i
16
MAKING INFERENCES ABOUT SLOPE
E(y) = 0 + 1x
Ho: 1 = 0
Ha: 1 ≠ 0
If 1 = 0, then x has no influence on y.
If we reject Ho, we say that x has a statistically
significant effect on y.
To test the null, we need to know the sampling
̂
distribution of 1
17
MAKING INFERENCES ABOUT THE
SLOPE 1
Sampling Distribution of̂1 for large n:
̂1 ~ N ( 1, ˆ
1
SS xx
)
Typically approximate
s
ˆ by
sˆ
1
SS xx 1
SS xx
So, when n is large, we use a z-statistic ~ N(0,1)
When n is small, we typically use a t-statistic ~ t( n-2 )
For large n, the distributions of z and t statistics are
almost the same
18
MAKING INFERENCES ABOUT THE
SLOPE 1
A Test of Model Utility: Simple Linear
Regression
One-Tailed Test Two-Tailed Test
H0: β1=0 H0: β1=0
Ha: β1<0 (or Ha: β1>0) Ha: β1≠0 s 2 SSE
SSE
Degrees of freedom for error n 2
ˆ1 ˆ1
Test statistic :t
sˆ s SS xx
1
Rejection region: t< -tα Rejection region: |t|> tα/2
(or t< -tα when Ha: β1>0) 19
Where tα and tα/2 are based on (n-2) degrees of freedom
EXAMPLE
We estimated a simple relationship between
advertising and sales based on a sample of 5
observations. Is the true relationship
statistically significant at the .05 level of
significance?
20
TEST OF SLOPE COEFFICIENT
SOLUTION
H : 1 = 0
0
H : 1 0
a
.05
df 5 – 2 = 3
Critical Value(s):
Reject H0 Reject H0
.025 .025
-3.182 0 3.182 t
21
TEST OF SLOPE COEFFICIENT
SOLUTION
H : 1 = 0
0 Test Statistic:
H : 1 0
a
.05
df 5 – 2 = 3
Critical Value(s): Decision:
Reject H0 Reject H0 Reject Ho at = .05
.025 .025 Conclusion:
There is evidence of a
-3.182 0 3.182 t relationship
22
MAKING INFERENCES ABOUT THE
SLOPE 1
Confidence Interval for 1 : ˆ1 t 2 sˆ
1
[0.090, 1.309]
We can be 95% confident that the true mean
increase in monthly sales revenue per
additional $100 of advertising expenditure is
between $90 and $1,310.
23
REGRESSION RESULTS IN R
24
PREDICTION WITH REGRESSION
MODELS
Types of predictions
Point estimates
Interval estimates
What is predicted
Population mean response E(y) for given x
Point on population regression line
Individual response (yi) for given x
25
WHAT IS PREDICTED
y
yIndividual ^b x
^b 0 + 1
^y i =
Mean y, E(y)
E(y) = b0 + b1x
Prediction, ^
y
x
xP 26
USING THE MODEL FOR
ESTIMATION AND PREDICTION
100(1-α)% Confidence Interval for Mean Value of y at
x=xp
1 xp x
2
yˆ t 2 s
n SS xx
100(1-α)% Prediction Interval for an Individual New
Value of y at x=xp
1 xp x
2
yˆ t 2 s 1
n SS xx
27
where tα/2 is based on (n-2) degrees of freedom
EXAMPLE
Find a 95% confidence interval for the mean
monthly sales when the store spends $400
on advertising.
28
EXAMPLE
Predict the monthly sales for next month if
$400 is spent on advertising. Use a 95%
prediction interval.
29
CONFIDENCE INTERVALS V.
PREDICTION INTERVALS
y
^b xi
^b 0 + 1
^y i =
x
x 30
MODEL ASSUMPTIONS
31
MODEL ASSUMPTIONS
So far we only estimated deterministic component. Now we
turn our attention to random error ϵ. We first need some
modeling assumptions…
Assumption 1: E( /x) = E( ) = 0
The mean of the probability distribution of is 0. This
implies mean value of y for a given value of x is 0 + 1x.
y = 0 + 1 x +
Since, E( /x) = E( ) = 0,
E(y/x) = 0 + 1x
32
Sometimes, just written as E(y) = 0 + 1x
MODEL ASSUMPTIONS
Assumption 2: Homoskedasticity
• The variance of the probability distribution of
is constant for all settings of the independent
variable x. For our straight-line model, this
assumption means that the variance of is
equal to a constant, say 2, for all values of x.
• When this assumption does not hold, we say
we have a problem of heteroskedasticity.
33
MODEL ASSUMPTIONS
Assumption 3: Normality
The probability distribution of is normal.
Assumption 4: No Autocorrelation
The values of associated with any two observed
values of y are independent–that is, the value of
associated with one value of y has no effect on the
values of associated with other y values.
34