Lecture 2 Linear Regression: A Model for the Mean
Sharyn OHalloran
Closer Look at:
Linear Regression Model
Least squares procedure Inferential tools Confidence and Prediction Intervals
Assumptions Robustness Model checking Log transformation (of Y, X, or both)
Spring 2005 2
U9611
Linear Regression: Introduction
Data: (Yi, Xi) for i = 1,...,n Interest is in the probability distribution of Y as a function of X Linear Regression model:
U9611
Mean of Y is a straight line function of X, plus an error term or residual Goal is to find the best fit line that minimizes the sum of the error terms
Spring 2005 3
Estimated regression line
Steer example (see Display 7.3, p. 177)
Equation for estimated regression line:
Intercept=6.98
7
.73
6.5
Fitted line
^ 6.98-.73X Y=
PH
Error term
5.5 0
1 ltime Fitted v alues PH
U9611
Spring 2005
Create a new variable ltime=log(time) Regression analysis
U9611
Spring 2005
Regression Terminology
Regression: Regression the mean of a response variable as a function of one or more explanatory variables:
{Y | X }
Regression model: model an ideal formula to approximate the regression Simple linear regression model: model
{Y | X } = 0 + 1 X
mean of Y given X or regression of Y on X Intercept
Spring 2005
Slope
Unknown parameter
6
U9611
Regression Terminology
Y Dependent variable
Explained variable Response variable
X Independent variable
Explanatory variable Control variable
Ys probability distribution is to be explained by X b0 and b1 are the regression coefficients
(See Display 7.5, p. 180) Note: Y = b0 + b1 X is NOT simple regression
U9611 Spring 2005 7
Regression Terminology: Estimated coefficients
0 + 1X
0 + 1X
0 + 1X 0 1
0+ 1
0 + 1X
0 + 1
Choose
U9611
and
1 to make the residuals small
Spring 2005 8
Regression Terminology
Fitted value for obs. i is its estimated mean: Y = fiti = {Y | X } = 0 + 1 X Residual for obs. i:
resi = Yi - fit i ei = Yi Y
Least Squares statistical estimation method finds those estimates that minimize the sum of squared residuals.
2 ( y ( + x )) = ( y y ) i 0 1i i 2 i =1 i =1 n n
Solution (from calculus) on p. 182 of Sleuth
U9611 Spring 2005 9
Least Squares Procedure
The Least-squares procedure obtains estimates of the linear equation coefficients 0 and 1, in the model
i = 0 + 1xi y
2 i
by minimizing the sum of the squared residuals or errors (ei)
2 SSE = e = ( yi yi )
This results in a procedure stated as
SSE = e = ( yi ( 0 + 1 xi ))
2 i
Choose 0 and 1 so that the quantity is minimized.
Spring 2005 10
U9611
Least Squares Procedure
The slope coefficient estimator is
= 1
( x X )( y
i =1 i n i =1
Y )
CORRELATION BETWEEN X AND Y
2 x X ( ) i
sY = rxy sX
STANDARD DEVIATION OF Y OVER THE STANDARD DEVIATION OF X
And the constant or intercept indicator is
=Y X 0 1
U9611 Spring 2005 11
Least Squares Procedure(cont.)
Y i e l d (B u s h e l / A c r e )
Note that the regression line always goes through the mean X, Y. Think of this regression line as Relation Between Yield and Fertilizer the expected value 100 of Y for a given 80 value of X.
60 40 20 0 0 100 200 300 400 500 600 700 800 Fertilizer (lb/Acre)
Trend line
That is, for any value of the independent variable there is a single most likely value for the dependent variable
U9611
Spring 2005
12
Tests and Confidence Intervals for 0, 1
Degrees of freedom:
(n-2)
= sample size - number of coefficients
Variance {Y|X}
2= (sum of squared residuals)/(n-2)
Standard errors (p. 184) Ideal normal model:
the
sampling distributions of 0 and 1 have the shape of a t-distribution on (n-2) d.f.
Do t-tests and CIs as usual (df=n-2)
U9611 Spring 2005 13
P values for Ho=0
Confidence intervals
U9611
Spring 2005
14
Inference Tools
Hypothesis Test and Confidence Interval for mean of Y at some X:
Estimate the mean of Y at X = X0 by
+ X {Y | X 0 } = 0 1 0
Standard Error of 0
{Y | X 0 }] = SE [
1 ( X 0 X )2 + 2 n ( n 1) s x
Conduct t-test and confidence interval in the usual way (df = n-2)
Spring 2005 15
U9611
Confidence bands for conditional means
confidence bands in simple regression have an hourglass shape, narrowest at the mean of X
the lfitci command automatically calculate and graph the confidence bands
U9611
Spring 2005
16
Prediction
Prediction of a future Y at X=X0
{Y | X 0 } Pred(Y | X 0 ) =
2
Standard error of prediction: prediction
+ ( SE[ (Y | X 0 )]) SE[Pred(Y | X 0 )] =
Variability of Y about its mean
Uncertainty in the estimated mean
95% prediction interval: interval Pred (Y | X 0 ) t df (.975) * SE[Pred(Y | X 0 )]
U9611 Spring 2005 17
Residuals vs. predicted values plot
After any regression analysis we can automatically draw a residual-versus-fitted plot just by typing
U9611
Spring 2005
18
Predicted values (yhat)
After any regression, the predict command can create a new variable yhat containing predicted Y values about its mean
U9611
Spring 2005
19
Residuals (e)
the resid command can create a new variable e containing the residuals
U9611
Spring 2005
20
The residual-versus-predicted-values plot could be drawn by hand using these commands
U9611
Spring 2005
21
Second type of confidence interval for regression prediction: prediction band
This express our uncertainty in estimating the unknown value of Y for an individual observation with known X value
Command: lftci with stdf option
Additional note: Predict can generate two kinds of standard errors
for the predicted y value, which have two different applications.
Confidence bands for conditional means (stdp)
3
3
Confidence bands for individual-case predictions (stdf)
Distance
-500
VELOCITY
500
1000
-1 -500
Distance 1
VELOCITY
500
1000
Confidence bands for conditional means (stdp)
3
95% confidence interval for {Y|1000}
Distance 2
confidence band: band a set of confidence intervals for {Y|X0}
-500 0 VELOCITY 500 1000
Confidence bands for individual-case predictions (stdf)
3
Calibration interval: interval values of X for which Y0is in a prediction interval
U9611
-1 -500
Distance 1
95% prediction interval for Y at X=1000
Spring 2005
VELOCIT Y
500
24
1000
Notes about confidence and prediction bands
Both are narrowest at the mean of X Beware of extrapolation
The width of the Confidence Interval is zero if n is large enough; this is not true of the Prediction Interval.
Spring 2005 25
U9611
constant variance.
Review of simple linear regression 1. Model with {Y | X } = 0 + 1 X
var{Y | X } = 1 =
2 n
2. Least squares: choose estimators 0 and 1 to minimize the sum of squared residuals.
(X
i =1
X )(Yi Y ) / ( X i X ) .
2 i =1
0 = Y X 1 X (i = 1,.., n) resi = Yi 0 1 i = resi /(n 2)
2
2
3. Properties of estimators.
i =1
2 ) = SE ( / ( n 1 ) s x 1
U9611
2 2 Spring 2005 SE ( 0 ) = / (1 / n) + X /(n 1) s x26
Assumptions of Linear Regression
A linear regression model assumes: Linearity:
U9611
Constant Variance:
{Y|X} = 0 + 1X var{Y|X} = 2
Normality
Dist. of Ys at any X is normal Given Xis, the Yis are independent
Spring 2005 27
Independence
Examples of Violations
Non-Linearity
The
true relation between the independent and dependent variables may not be linear.
For example, consider campaign fundraising and the probability of winning an election.
P (w )
Probability of Winning an Election
The probability of winning increases with each additional dollar spent and then levels off after $50,000.
$ 5 0 ,0 0 0 S p e n d in g
28 Spring 2005
U9611
Consequences of violation of linearity
: If linearity is violated, misleading conclusions may occur (however, the degree of the problem depends on the degree of non-linearity)
U9611
Spring 2005
29
Examples of Violations: Constant Variance
Constant Variance or Homoskedasticity
The
Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others.
Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner.
Example:
income and spending on certain goods.
People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower incomes.
Spring 2005 30
U9611
Violation of constant variance
X 10 X8 Spending X6 8
Relation between Income and Spending violates homoskedasticity
= (Y6 (a + bX6))
6
6 7 5 X7 X9
9
= (Y9 ( a + bX9))
9
X4 X2
X X1
U9611
X5
As income increases so do the errors (vertical distance from the predicted line)
income
31
Spring 2005
Consequences of non-constant variance
If constant variance is violated, LS estimates are still unbiased but SEs, tests, Confidence Intervals, and Prediction Intervals are incorrect
However, the degree depends
U9611
Spring 2005
32
Violation of Normality
Non-Normality
Frequency of Nicotine use
Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every day.
An example of a bimodal distribution
U9611
Spring 2005
33
Consequence of non-Normality
If normality is violated,
LS estimates are still unbiased tests and CIs are quite robust PIs are not
Of all the assumptions, this is the one that we need to be least worried about violating. Why?
U9611 Spring 2005 34
Violation of Non-independence
Residuals of GNP and Consumption over Time
Non-Independence
The independence assumption means that errors terms of two variables will not necessarily influence one another.
Highly Correlated
Technically, the RESIDUALS or error terms are uncorrelated.
The most common violation occurs with data that are collected over time or time series analysis.
Example: high tariff rates in one period are often associated with very high tariff rates in the next period. Example: Nominal GNP and Consumption
35
U9611
Spring 2005
Consequence of non-independence
If independence is violated: - LS estimates are still unbiased - everything else can be misleading
Plotting code is litter (5 mice from each of 5 litters)
U9611
Log Height
Note that mice from litters 4 and 5 have higher weight and height
Spring 2005
Log Weight
36
Robustness of least squares
The constant variance assumption is important. Normality is not too important for confidence intervals and p-values, but is important for prediction intervals. Long-tailed distributions and/or outliers can heavily influence the results. Non-independence problems: serial correlation (Ch. 15) and cluster effects (we deal with this in Ch. 9-14).
Strategy for dealing with these potential problems
Plots; Residual plots; Consider outliers (more in Ch. 11) Log Transformations (Display 8.6)
U9611
Spring 2005
37
Tools for model checking
Scatterplot of Y vs. X (see Display 8.6 p. 213)* Scatterplot of residuals vs. fitted values*
*Look for curvature, non-constant variance, and outliers
Normal probability plot (p.224)
It is sometimes usefulfor checking if the distribution is symmetric or normal (i.e. for PIs).
(Section 8.5).
U9611
Lack of fit F-test when there are replicates
Spring 2005 38
Scatterplot of Y vs. X
Command: graph twoway Case study: 7.01 page175
U9611
Y X
Spring 2005 39
Scatterplot of residuals vs. fitted values
Command: rvfplot, Case study: 7.01 page175
U9611
yline(0)
Spring 2005 40
Normal probability plot
(p.224)
Quantile normal plots compare quantiles of a variable distribution with quantiles of a normal distribution having the same mean and standard deviation. They allow visual inspection for departures from normality in every part of the distribution.
Command: qnorm variable, Case study: 7.01, page 175
U9611
grid
41
Spring 2005
Diagnostic plots of residuals
Plot residuals versus fitted values almost always:
For simple reg. this is about the same as residuals vs. x Look for outliers, curvature, increasing spread (funnel or horn shape); then take appropriate action.
If data were collected over time, plot residuals versus time
Check for time trend and Serial correlation
If normality is important, use normal probability plot.
A straight line is expected if distribution is normal
Spring 2005
U9611
42
Voltage Example (Case Study 8.1.2)
Goal: to describe the distribution of breakdown time of an insulating fluid as a function of voltage applied to it.
Y=Breakdown time X= Voltage
Statistical illustrations
Recognizing the need for a log transformation of the response from the scatterplot and the residual plot Checking the simple linear regression fit with a lack-of-fit F-test Stata (follows)
Spring 2005 43
U9611
Simple regression
The residuals vs fitted values plot presents increasing spread with increasing fitted values
Next step: We try with
log(Y) ~ log(time)
U9611
Spring 2005
44
Simple regression with Y logged
The residuals vs fitted values plot does not present any obvious curvature or trend in spread.
U9611
Spring 2005
45
Interpretation after log transformations
Model
Level-level Level-log Log-level Log-log
Dependent Independent Variable Variable Y Y log(Y) log(Y) X log(X) X log(X)
Spring 2005
Interpretation of 1 y=1x y=(1/100)%x %y=(1001)x % y=(1)%x
46
U9611
Dependent variable logged
{log(Y)|X} = 0 + 1X
(if the distribution of
is the same as:
log(Y), given X, is symmetric)
Median {Y || X } = e 0 + 1 X
As X increases by 1, what happens?
Median {Y | X = x + 1} e = 0 + 1 x Median {Y | X = x} e
1
0 + 1 ( x +1)
=e
Median {Y | X = x + 1} = e Median {Y | X = x}
U9611 Spring 2005 47
Interpretation of Y logged
As X increases by 1, the median of Y changes by the multiplicative factor of e 1 . Or, better: If 1>0: As X increases by 1, the median of Y
increases by
(e 1) *100%
If 1 < 0: As X increases by 1, the median ( 1 e ) * 100 % of Y decreases by
1
U9611
Spring 2005
48
1- e-0.5=.4
Example: {log(time)|voltage} = 0 1 voltage
U9611
Spring 2005
49
{log(time)|voltage} = 18.96 - .507voltage 1- e-0.5=.4
It is estimated that the median breakdown time decreases by 40% with each 1kV increase in voltage
2500
25 30 Fitted values 35 40 8
-2
0 25
Breakdown time (minutes) 500 1000 1500 2000
Log of time until breakdown 0 2 4 6
VOLTAGE
30
VOLTAGE Fitted values
35 TIME
40
logarithm of breakdown time
U9611
Spring 2005
50
If the explanatory variable (X) is logged
If {Y|log(X)} = 0 + 1log(X) then:
Associated with each two-fold increase (i.e doubling) of X is a 1log(2) change in the mean of Y.
An example will follow:
U9611
Spring 2005
51
Example with X logged
(Display 7.3 Case 7.1):
Y = pH X = time after slaughter (hrs.) estimated model: {Y|log(X)} = 6.98 - .73log(X).
doubling of time after slaughter (between 0 and 8 hours) the mean pH decreases by .5.
7 7 0 .5 1 ltime Fitted v alues 1.5 PH 2 5.5 0 6 pH 6.5
-.73log(2) = -.5 It is estimated that for each
5.5
pH
6.5
4 TIME Fitted v alues PH
U9611
Spring 2005
52
Both Y and X logged
{log(Y)|log(X)} = 0 + 1log(X) is the same as:
As X increases by 1, what happens?
If 1>0: As X increases by 1, the median of Y increases by
(e
log( 2 ) 1
1) *100%
If 1 < 0: As X increases by 1, the median of Y decreases by
(1 e
log( 2 ) 1
) *100%
U9611
Spring 2005
53
Example with Y and X logged
Y: number of species on an island X: island area
Display 8.1 page 207
{log(Y)|log(X)} = 0 1 log(X)
U9611
Spring 2005
54
Y and X logged
{log(Y)|log(X)} = 1.94 .25 log(X) Since e.25log(2)=.19
Associated with each doubling of island area is a 19% increase in the median number of bird species
U9611
Spring 2005
55
Example: Log-Log
In order to graph the Log-log plot we need to generate two new variables (natural logarithms)
U9611
Spring 2005
56