Linear Regression Assumptions
about the error term
Regression III: Dummy Mean of Probability Distribution of the
Variable Regression Error term is zero
Probability Distribution of Error Has
Constant Variance = F2
Tom Ilvento Probability Distribution of Error is Normal
FREC 408 Errors are Independent – they are
uncorrelated with each other
(page553)
Regression Applications - we
will look at Dummy Variable Regression
Dummy variable regression as an Regression using a dichotomus
alternative to ANOVA independent variable or set of variables
Regression will estimate the same
relationship as ANOVA
Multiple Regression There will be a few important
changes in Excel
The Data need to be in columns with
matched data for all the variables –
no missing values for any variable
For Any Categorical Variable Dummy Variables
I can represent any categorical variable Example Religion (Protestant, Catholic,
with j classes Jewish)
With j-1 dummy variables, coded as 0 Dummy 1 (X1) = 1 if Protestant, 0 if not
and 1 Dummy 2 (X2) = 1 if Catholic, 0 if not
For example, If you are Jewish, then you will have a
value of zero on Dummy 1 (X1) and
Sex has 2 classes – male and Female
Dummy 2 (X2)
Represent as one variable coded 1 if
The other class is called the reference
female and 0 if male category and is captured in the intercept term
1
Example Problem ANOVA Results
Examines the Sorption Rate of three different Anova: Single Factor
hazardous organic solvents SUMMARY
Aromatics Groups Count Sum Average Variance
Aromatics 9 8.48 0.942 0.028
Chloroalkanes Cloro 8 8.05 1.006 0.161
Esters Esters 15 4.95 0.330 0.043
Asks if there are differences among the three?
ANOVA
Sample of 32 sorption rates across the three Source of Variation SS df MS F P-value F crit
classes of organic hazardous solvents Between Groups 3.305 2 1.653 24.512 0.000 3.328
The dependent variable is Sorption Rate Within Groups 1.955 29 0.067
Total 5.261 31
ANOVA Hypothesis Test for
the Factor Regression Approach
Null hypothesis H0: :1=:2= :3 For Excel, reorganize the data
Alternative Ha: At least two means differ Dependent variable is in a single
Assumptions Equal variances, normal distribution column
Test Statistic F* = 24.512 Classes are coded as 0/1 in
Rejection Region F.05, 2, 29 d.f. = 3.328 contiguous columns
Conclusion F* > F Run Tools, Data Analysis, Regression
24.512 > 3.328 and pick two of the three classes to be
included in the model
Reject H0: :1=:2= :3
There are differences across the organic chemicals
Regression Output with Esters
Let’s look at Excel as the reference category
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7927
R Square 0.6283
Adjusted R Square 0.6027
Standard Error 0.2597
Observations 32
ANOVA
df SS MS F Signif. F
Regression 2 3.3054 1.6527 24.5115 0.0000
Residual 29 1.9553 0.0674
Total 31 5.2608
2
Regression gives us
ANOVA Table is the same! Coefficients
Anova: Single Factor
Coefficients Std Error t Stat P-value
SUMMARY Intercept 0.3300 0.0670 4.9221 0.0000
Groups Count Sum Average Variance Aromatics 0.6122 0.1095 5.5919 0.0000
Aromatics 9 8.48 0.942 0.028 Cloro 0.6763 0.1137 5.9487 0.0000
Cloro 8 8.05 1.006 0.161
Esters 15 4.95 0.330 0.043
)
ANOVA Y = .3300 + .6122( Aromatics) + .6763(Cloro)
Source of Variation SS df MS F P-value F crit
Between Groups 3.305 2 1.653 24.512 0.000 3.328
Within Groups 1.955 29 0.067
Total 5.261 31
Estimated values from our The model estimated the
equation mean levels for each solvent
Since our independent variables are dummy Aromatics Cloro Esters
variables, it is easy to solve the equation
)
Y = .3300 + .6122( Aromatics) + .6763(Cloro) Mean 0.942 1.006 0.330
Standard Error 0.056 0.142 0.054
When Aromatics = 1 Median 0.950 1.015 0.340
Mode #N/A #N/A 0.060
= .33 + .6122(1) + .6763(0) = .9422 Standard Deviation 0.168 0.401 0.208
When Chloroalkanes = 1 Sample Variance 0.028 0.161 0.043
= .33 + .6122(0) +.6763(1) = 1.006
When Aromatics and Chloroalkanes =0
= .33 + .6122(0) + .6763(0) = .3300
This represents Esters!
The t-test for the dummy Hypothesis Test for a slope
coefficients coefficient for Cloro,
Coefficients Std Error t Stat P-value Null hypothesis H0: $2 = 0
Intercept 0.3300 0.0670 4.9221 0.0000 Alternative Ha: $2 ≠ 0 two-tailed test
Aromatics 0.6122 0.1095 5.5919 0.0000
Cloro 0.6763 0.1137 5.9487 0.0000 Assumptions Large sample, normal
) Test Statistic t* = .6763-0)/.1137
Y = .3300 + .6122( Aromatics ) + .6763(Cloro) Calculation t*= 5.9487
P-value P = .000
The test-test for Aromatics and Cloro represent
Conclusion
a test if each are significantly different from Reject H0: $2 = 0
Esters, i.e., a difference of means test!!
3
Regression Output with Aromatics Regression Output with Aromatics
as the reference category as the reference category
SUMMARY OUTPUT
Regression Statistics Coefficients Std Error t Stat P-value
Multiple R 0.7927 Intercept 0.9422 0.0866 10.8858 0.0000
R Square 0.6283 Cloro 0.0640 0.1262 0.5075 0.6157
Adjusted R Square 0.6027
Esters -0.6122 0.1095 -5.5919 0.0000
Standard Error 0.2597
Observations 32
ANOVA
The coefficients reflects the difference between Cloro
df SS MS F Signif F and Esters from Aromatics
Regression 2 3.3054 1.6527 24.5115 0.0000
Residual 29 1.9553 0.0674
Total 31 5.2608 Notice that the t-test for Cloro shows that it is
not significantly different from Aromatics
MN Apartment Sales
Example We will focus on Condition
A real estate agent wants to use regression Price Condition
analysis to explore the relationship $79,300 F Condition has 3
between the sale prices of apartment $90,300 F levels or categories
buildings and various characteristics of the $93,600 F
apartments, including: $108,750 G Fair
# of apartments $110,000 G Good
age of structure $134,400 E
$155,700 E Excellent
lot size
$157,500 G
# of on-site parking spaces $162,500 G
gross building area
condition of apartment building This is just part of the data. We need
to create the dummy variables
MN Apartment Sales MN Apartment Sales
Example - dummy variables Example - dummy variables
First, look at the relationship between PRICE The last category “Fair” is the reference
and Condition of apartment building
category
There are three categories – Excellent, Good
and Fair.
Need (3-1) = 2 dummy variables When X1=0 and X2=0, the condition of
1, if condition is Excellent apartment building is Fair
X1 =
0, if condition is NOT excellent
I will label the two dummy variables (X1
1, if condition is Good and X2 ) as Excellent and Good
X2 =
0, if condition is NOT good
4
I create two dummy variables to
Regression Analysis represent Condition
In Excel, we want the regression of Price on Price Condition Excellent Good
Conditions by using two dummy variables X1 $79,300 F 0 0
and X2 $90,300 F 0 0
$93,600 F 0 0
I will label them as Excellent and Good $108,750 G 0 1
Reorganize the data $110,000 G 0 1
Dependent variable Price is in a single
$134,400 E 1 0
column $155,700 E 1 0
$157,500 G 0 1
Excellent and Good are coded as 0/1 $162,500 G 0 1
Run Tools, Data Analysis, Regression and pick
Excellent and Good to be included in the
model
There does seem to be There also is a lot of spread
differences across the groups across the groups
Graph of Price by Condition
Mean Apartment Price By Condition
$400,000.00
$300,000.00 Fair
$200,000.00
$100,000.00 Good
$0.00
Excellent
Mean Level
Excellent Good Fair
79290 179290 279290 379290 479290 579290 679290 779290 879290 979290
Regression Output with two Regression Output with two
dummy variables dummy variables
SUMMARY OUTPUT
Regression Statistics
Regression Statistics
Multiple R 0.288 Multiple R 0.288
R Square 0.083
Adjusted R Square 0.000
R Square 0.083
Standard Error 211572.694 Adjusted R Square 0.000
Observations 25
Standard Error 211572.694
ANOVA Observations 25
df SS MS F Sig F
R2 is only 0.083. Is this number too small?
Regression 2 89083840373.383 44541920186.692 0.995 0.386
Residual 22 984786106940.857 44763004860.948
Total 24 1073869947314.240
Is our model is a good one?…
Intercept
Coeff
176940.000
Std Error
94618.185
t Stat
1.870
P-value
0.075
… We’re going to do some tests on it later.
Excellent 173310.000 128113.628 1.353 0.190
Good 128641.286 110226.850 1.167 0.256
5
Regression gives us Coefficients and
Estimated Model between Price and Estimated values from our
Condition equation
Coeff Std Error t Stat P-value
When Excellent = 1
Intercept 176940.00 94618.185 1.870 0.075 Price = 176940+173310(1) + 128641(0)
Excellent 173310.00 128113.628 1.353 0.190 = $350,250
Good 128641.29 110226.850 1.167 0.256
When Good = 1
Price= 176940 +173310(0) + 128641(1)
= $305,581
Estimated equation/model:
When Excellent and Good =0, Fair
Price= 176940 +173310(0) + 128641(0)
Yˆ = 176940+ 173310∗ (Excellent) + 128641∗ (Good) = $176,940
F Test for the estimated
Regression/ANOVA Output model
H0: :1=:2 =0
Null hypothesis
SUMMARY OUTPUT
Regression Statistics Ha: At least one mean differs from 0
Multiple R 0.288 Alternative Equal variances, normal distribution
R Square 0.083
Adjusted R Square
Standard Error
0.000
211572.694
Assumptions F* = 0.995
Observations 25
Test Statistic F.05, 2, 22 d.f. = 3.44
ANOVA
df SS MS F Sig F
Rejection Region F* < F
Regression
Residual
2
22
89083840373.383
984786106940.857
44541920186.692
44763004860.948
0.995 0.386
Conclusion 0.995 < 3.44
Can’t reject H0: :1=:2= 0
Total 24 1073869947314.240
Intercept
Coeff
176940.000
Std Error
94618.185
t Stat
1.870
P-value
0.075
It seems the conditions have no statistically
Excellent 173310.000 128113.628 1.353 0.190 significant compact on apartment price.
Good 128641.286 110226.850 1.167 0.256
Is that true in real life?
The t-test for the dummy Hypothesis Test for a slope coefficient
coefficients for Excellent
Null hypothesis H0: $1 = 0
Coeff Std Error t Stat P-value
Intercept 176940.00 94618.185 1.870 0.075 Alternative Ha: $1 ≠ 0 two-tailed test
Excellent 173310.00 128113.628 1.353 0.190 Assumptions small sample, normal
Good 128641.29 110226.850 1.167 0.256
Test Statistic t*=(173310)/128113
Yˆ = 176940+ 173310( X 1) + 128641( X 2) Calculation t*= 1.353
P-value P = .190
The t-tests for X1 (Excellent) and X2 (Good)
Conclusion
represent a test if each are significantly Can’t reject H0: $1 = 0
different from “Fair” Based on our model and t-test, it indicates that the
condition of the apartment makes no difference on Price!
6
A multivariate Example Difference of Means Test
This focuses on whether females mid-level t-Test: Two-Sample Assuming Equal Variances
managers have lower salaries than males. M ales Females
The data set contains the following variables Mean 144.11 140.47
for 220 mid-level managers of firms (we will Variance 153.61 156.14
only focus on these four variables): Observations 145 75
SALARY Dependent Variable Base annual Pooled Variance 154.47
salary in $1,000s Hypothesized Mean Difference 0
SEX 1= Female; 0 = Male df 218
POSITION An index of the position of the t Stat 2.061
employee in the firm, based on the number of P(T<=t) one-tail 0.020
employees supervised, size of budget and so forth. t Critical one-tail 1.652
A higher number means higher level in the company P(T<=t) two-tail 0.040
YEARS EXP The number of years of experience t Critical two-tail 1.971
in the company
Conclusion: We have evidence that males earn more
Dummy Variable Regression Multiple Regression
gives the same result Output???
SUMMARY OUTPUT SUMMARY OUTPUT
Regression Statistics Regression Statistics
Multiple R 0.1383 Multiple R 0.8437
R Square 0.0191 R Square 0.7118
Adjusted R Square 0.0146 Adjusted R Square 0.7078
Standard Error 12.4287 Standard Error 6.7676
Observations 220 Observations 220
ANOVA ANOVA
df SS MS F Signif F
df SS MS F Sig F
Regression 3 24438.1513 8146.0504 177.8573 0.0000
Regression 1 656.2761 656.2761 4.2485 0.0405
Residual 216 9893.0260 45.8010
Residual 218 33674.9011 154.4720 Total 219 34331.1773
Total 219 34331.1773
Coeff. Std Error t Stat P-value
Coeff Std Error t Stat P-value Intercept 113.0608 1.6668 67.8307 0.0000
Intercept 144.1103 1.0321 139.6221 0.0000 Sex 2.2013 1.0804 2.0375 0.0428
Sex -3.6437 1.7678 -2.0612 0.0405 Position 6.7101 0.3126 21.4645 0.0000
YearsExper -0.4725 0.1133 -4.1691 0.0000