2015 Regression Using Stata and SAS
2015 Regression Using Stata and SAS
and Stata
Hsueh-Sheng Wu
CFDR Workshop Series
October 19, 2015
1
Outline
• What is regression analysis?
• Why is regression analysis popular?
• A primitive way of conducting regression analysis
• A better way of conducting regression analysis:
Corrections for violations in regression assumptions for
– Linearity
– Mean independence
– Homoscedasticity
– Uncorrelated disturbances
– Normal disturbance
• Conclusions
2
What Is Regression?
Regression is used to study the relation between a single
dependent variable and one or more independent
variables. In regression, the dependent variable y is a
linear function of the x’s, plus a random disturbance ε.
y = a + b1x1 + b2x2 + ε
3
Five Assumptions of Regression
1. Linearity
– y is a linear function of the x’s
2. Mean independence
– the mean of the disturbance term is always 0 and
does not depend on the value of x’s
3. Homoscedasticity
– The variance of ε does not depend on the x’s
4. Uncorrelated disturbances
– The value of ε for any individual in the sample is not
correlated with the value of ε for any other individuals
5. Normal disturbance
– ε has a normal distribution
4
What Is Regression Analysis Popular?
• Statistical convenience. All statistic software provide
regression analysis.
• Intuitive logic. Regression analysis fits our thinking style,
that is, once we observed a phenomenon (i.e.,
dependent variable), what may contribute to this
phenomenon.
• Various types of regression models
– Based on the number of independent variables
• Simple regression
• Multiple Regression
– Based on the type of the dependent variable
• Ordinary least square regression
• Logistic regression
• Ordered logistic regression
• Multinomial logistic regression
• Poisson regression
– Based on the number of dependent variables
• Structural Equation Modeling 5
• Hierarchical Linear Regression
A Primitive Way of Conducting Regression Analysis
• Decide a research question
e.g., Whether the price of the car is determined by the weight,
length, and the repair records of cars
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
7
Stata commands:
webuse auto.dta, clear
reg price weight length rep78
Stata Output:
8
A Better Way of Conducting Regression Analysis
• Decide a research question
10
Linearity Assumption (Cont.)
How to detect the inaccurate specification of the
models?
•Plot y against x
•Plot residuals against x
•Plot residuals against yhat
SAS commands:
proc reg data=auto;
model price = length;
plot price*length;
plot rstudent.*length;
plot rstudent.*p. / noline;
run;
11
Linearity Assumption (Cont.)
12
Linearity Assumption (Cont.)
Stata commands:
webuse auto.dta, clear
reg price length
predict r, rstudent
predict yhat, xb
scatter price length
scatter r length
scatter r yhat
13
Linearity Assumption (Cont.)
4
3
Studentized residuals
0 1 -1 2
14
Linearity Assumption (Cont.)
Check for influential observations:
• Outliers:
If observations have standardized residuals that exceed =2 or -2, they may
indeed outliers.
15
Linearity Assumption (Cont.)
SAS commands:
proc reg data = in.auto;
model price = weight length rep78;
Output out=in.outlier(keep = make price weight length rep78 r lever cooked dffit)
rstudent = r h=lever cookd = cooked dffits = dffit;
run;
quit;
18
Linearity Assumption (Cont.)
Stata commands:
dfbeta
list make _dfbeta_1 _dfbeta_2 _dfbeta_3 if abs(_dfbeta_1) > (2/sqrt(69)) &
_dfbeta_1 ~=.
19
Linearity Assumption (Cont.)
. list make dfit if abs(dfit)>2*sqrt(3/69) & dfit ~=.
make dfit
. list make _dfbeta_1 _dfbeta_2 _dfbeta_3 if abs(_dfbeta_1) > (2/sqrt(69)) & _dfbeta_1 ~=.
20
.
Linearity Assumption (Cont.)
Solutions:
• Re-specify the model by mathematically transforming x’s. e.g., for a
curvilinear relation, you can square the x’s.
– log transformation
– Exponentiation transformation is the use of the inverse of a logarithm, as in x’ = εx
– polynomial transformation is the use of powers of the variable, as in x’ = x2, x’ = x3, x’ =
SQRT(x). We use this approach often in multiple regression.
– rescale the x variable into a dummy (dichotomous) variable
• Restrict the range of x
• Identify the influential cases and examine whether they should be
included in the sample
21
Mean Independence
What does it mean?
• The mean of the disturbance term is always 0 and does not depend
on the value of x’s.
• Possible causes of violating this assumption:
– omitted x variables: if any of the omitted variables is associated
with the x’s.
– reverse causation: if y influence x’s, then ε is associated with the
x’s.
– measurement error in the x: x includes not only x but also
something else. This something else will get into ε.
22
Mean Independence (Con.)
How to detect the violation?
Link test: if the current model is a good model, no
additional predictors have significant
associations with the dependent variable.
23
Mean Independence (Cont.)
SAS commands for Link test:
proc reg data=auto;
model price = length;
output out=auto2 (keep= price length yhat) predicted=yhat;
run;
quit;
data auto3;
set auto2;
yhat2= yhat**2;
run;
proc reg data=auto3;
model price = yhat yhat2;
run;
Stata commands:
webuse auto.dta, clear
reg price length
predict yhat, xb
gen yhat2 = yhat*yhat
reg price yhat yhat2
24
Mean Independence (Cont.)
Solutions:
• Use of past literatures to justify your model
• Use experimental design to collect your data, which not only support
the mean independence assumption, but also avoid reverse
causation
• If you use survey design and have measures of relevant variables
that have not been included in the model, you can include these
variables in the model to reduce the possibility of violating this
assumption
• Use simultaneous equations to model reciprocal relations between
x’s and y
• Choose measures with high reliability or include measurement
models in regression analysis
25
Homoscedasticity
What does it mean?
• Homoscedasticity means that the variance of ε is the same across
all levels of x’s.
• Possible causes of violating this assumption.
– Improvement in data collection techniques: During the course of
data collection, the interviewers are getting better and less likely
to commit error in collecting data.
– Learning: Respondents are less likely to have errors in
answering the same questions when being interviewed in the
follow-up survey than in the baseline survey.
– Outliers
What are the consequences?
• Inefficiency: observations with larger disturbance variance contain
less information than observations with smaller disturbance
variance. but OLS weights them equally.
• Bias in standard errors can leads to incorrect conclusions.
26
Homoscedasticity (Cont.)
How to detect the violation?
• Plot residuals against X
• Plot residuals against Yhat
• White test
• Cameron & Trivedi's decomposition of IM-test
• Breusch-Pagan / Cook-Weisberg
SAS commands:
proc reg data=auto;
model price = length weight rep78/ spec;
run; quit;
Stata commands:
webuse auto.dta, clear
reg price length weight rep78
estat imtest, white 27
estat hettest
Homoscedasticity (Cont.)
Solutions:
• Re-specify the model or transform the dependent variable
• Use robust standard errors
• Use weighted least squares only if you know what weights to use
28
Uncorrelated Disturbances
What it means?
• The disturbance variables for any two individuals must be
uncorrelated.
• Possible causes of violating this assumption
– Sample design: simple random sampling is not likely to cause this
problem, but a cluster sampling is.
– The selection of unit of analysis, e.g., the couple
– The use of panel data
Solutions:
• Include the cluster variables into the models as a control
• Use the cluster option in the regression analysis
• Use regression that can control for correlations among observations,
for example, Hierarchical Linear Model
30
Uncorrelated Disturbances (Cont.)
Solutions:
• Including the correlations among respondents into the regression
models
SAS commands:
proc genmod data=auto;
class foreign;
Model price = price weight rep78;
repeated subject=foreign / type=ind ;
run;
Stata commands:
reg price length rep78 weight, cluster(foreign)
31
Normality
What does it mean:
• The disturbance term ε need to be normally distributed, but x’s and y
do not.
– Positive Skewness
– Negative Skewness
– Positive Kurtosis
– Negative Kurtosis
• Possible causes of violation of this assumption
– The true distribution of the variable, e.g., some variables follow a binomial or
poisson distribution.
– Measurement artifacts
– Inadequate sample
33
Normality (Cont.)
SAS Commands:
proc reg data=auto;
model price= length weight rep78;
output out=auto2 (keep= price length weight rep78 res yhat)
residual=res predicted=yhat;
run;
Stata commands:
reg price length rep78 weight
swilk r
34
Normality (Cont.)
Solutions:
• Using larger samples
• Using conservative p-values (e.g., using 0.01 rather than
0.05)
35
Conclusions
. Regression analysis is the most commonly used technique
in social sciences
• To accurately use regression analysis, you need to check
for possible violations of the regression analysis
• Other useful resources for learning conducting regression
– http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm
– http://www.ats.ucla.edu/stat/stata/webbooks/reg/
– http://www.indiana.edu/~statmath/stat/all/panel/
– http://dss.princeton.edu/online_help/analysis/regression_intro.htm
36