Multiple Regression
Multiple Regression
2. • Estimate of coefficients
3. • Specification testing
4. • Check assumptions
5. • Validation
6. • Draw conclusion
• Faculty of Economics
• Institute of Economic Theory and Methodology
Data
Observations: Schools in Massachusetts (US)
Number of observations: 220
Source: Stock, James H. and Mark W. Watson (2003) Introduction to Econometrics, Addison-Wesley Educational Publishers
Variables:
• District: Name of district (coded)
• Municipality: name
• Spending per pupil, regular: thousand $
• Spending per pupil, special needs: thousand $
• Spending per pupil, bilingual: thousand $
• Spending per pupil, occupational: ezer $
• Spending per pupil, total: thousand $
• Students per computer
• Share of special education students (%)
• Share of receiving lunch subsidy (%)
• Students per teacher
• Average district per capita income: thousand $
• 4th grade score (math+english+science)
• 8th grade score (math+english+science)
• Average teacher salary: thousand $
• Share of english learners (%)
• Faculty of Economics
• Institute of Economic Theory and Methodology
SPSS
• Faculty of Economics
Exploratory data analysis
• Institute of Economic Theory and Methodology
Változók megismerése
• Faculty of Economics
• Institute of Economic Theory and Methodology
Multivariate regression
Population equation:
𝒀 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝑿𝟏 + 𝜷𝟐 ∗ 𝑿𝟐 + ⋯ + 𝜷𝒑 ∗ 𝑿𝒑 + 𝜺
Where,
𝛽0 : constant, If 𝑋1 , 𝑋2 , … , 𝑋𝑝 = 0 then 𝑌 = 𝛽0 + 𝜀
𝛽𝑝 : slope of line
It shows us the effect on 𝑌 (the expected change in
𝑌) of a unit change in 𝑋𝑝 , if the other independent
variables (X) remain constant (ceteris paribus).
• Faculty of Economics
• Institute of Economic Theory and Methodology
1. Aim of analysis
• Measure the effect of variables on 8th grade score (Y = totsc8):
– Students per teacher (𝑋1 = tchratio)
– Spending per pupil, regular (𝑋2 = regday)
– Spending per pupil, total (𝑋3 = 𝑡𝑜𝑡𝑑𝑎𝑦)
– Average teacher salary (𝑋4 = 𝑎𝑣𝑔𝑠𝑎𝑙𝑎𝑟𝑦)
– Share of special education students (%) (𝑋5 = 𝑠𝑝𝑒𝑐𝑒𝑑)
– Average district per capita income (𝑋6 = percap)
• Defining the specification:
Population equation (2):
2. Estimation of coefficients
• In order to validate, before the estimation
share the full sample into 2 parts.
– 70% - training part, we make estimation in this
part
– 30% - test part, to validate the results
• It is prepared in this dataset: partition
• Faculty of Economics
• Institute of Economic Theory and Methodology
Transform/Compute Variable
RV.BINOM(1,0.7) -> The new variable’s value will be 1 with 70%
probability
Value labels:
1 – Training (≈70%)
0 – Test (≈30%)
• Faculty of Economics
• Institute of Economic Theory and Methodology
Analyze/Regression/Linear
• Faculty of Economics
• Institute of Economic Theory and Methodology
3. Specification testing
Adj. R2: The
independent variables
(X) explain the
dependent (Y)
variable’s variance in
69.2%.
𝑯𝟎 : 𝜷𝟏 = 𝜷𝟐 = ⋯ = 𝜷𝒑 = 𝟎
𝑯𝟏 : ∃ 𝜷𝒑 ≠ 𝟎
𝑺𝒊𝒈 < 𝟓%, thus we can reject the H0, it means, that there
is a 𝜷 which not equal to 0. Therefore there is relationship
between the dependent variable and the independent
variables.
• Faculty of Economics
• Institute of Economic Theory and Methodology
4. Check assumptions
• Assumptions of the error term
• Assumptions of the independent variables
• Faculty of Economics
• Institute of Economic Theory and Methodology
Testing
autocorrelation
Inspection of
homoskedasticity
𝐸 𝜀|𝑋 = 𝑥 = 0
• The conditional distribution of error given X
has a mean of zero.
• In most cases you can proof it with logical
interpretation.
• If we estimate the coefficients with OLS the
average residual will be 0.
• Faculty of Economics
• Institute of Economic Theory and Methodology
𝑉𝑎𝑟 𝜀 = 𝜎 2
Transform/Compute Variable…
Variable label: Log of Average district per capita income
• Faculty of Economics
• Institute of Economic Theory and Methodology
Analyze/Regression/Linear
𝑋𝑖 , 𝑌𝑖 𝑖 = 1; … ; 𝑛, 𝑖. 𝑖. 𝑑
• The error term is uncorrelated across observations.
• If the observations are independent and identically
distributed, it means you use crossectional data, than
this assumption automatically met.
• In case of time series data, can cause biger problems
• Test:
– Plots
• We plot the residuals against the time or the order of obesrvations
on a scatter plot.
– Durbin-Watson test
• Faculty of Economics
• Institute of Economic Theory and Methodology
+ violator - violator
autocorrelation autocorrelation
𝑯𝟎 : 𝝆 = 𝟎 (𝒏𝒐 𝒂𝒖𝒕𝒐𝒄𝒐𝒓𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏)
𝑯𝟏 : 𝝆 ≠ 𝟎 (𝟏𝒔𝒕 𝒐𝒓𝒅𝒆𝒓 𝒂𝒖𝒕𝒐𝒄𝒐𝒓𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏)
0 dl du 2 4-du 4-dl 4
𝑯𝟎 keep range
• Faculty of Economics
• Institute of Economic Theory and Methodology
2
𝜀~𝑁(0, 𝜎 )
• The residuals follow normal distribution.
• Check:
– Graphs:
• Histogram
• P-P Plot
– Indicators:
• Skewness
• Kurtosis
– Significance test:
• Kolmogorov-Smirnov test
• Shapiro-Wilk test
• Faculty of Economics
• Institute of Economic Theory and Methodology
Visual check
• Faculty of Economics
• Institute of Economic Theory and Methodology
Significance test
Analyze/Regression/Linear
• Faculty of Economics
• Institute of Economic Theory and Methodology
Analyze/Descriptive Statistics/Explore…
• Faculty of Economics
• Institute of Economic Theory and Methodology
It is maybe,
because there are
outliers in dataset.
• Faculty of Economics
• Institute of Economic Theory and Methodology
Detecting outliers
• Graphs
• Mahanalobis distance
– A measure of how much a case's values on the independent variables differ
from the average of all cases. A large Mahalanobis distance identifies a case as
having extreme values on one or more of the independent variables.
• Checking Cook’s distance and leverage value
– Cook’s distance: A measure of how much the residuals of all cases would
change if a particular case were excluded from the calculation of the
regression coefficients. A large Cook's D indicates that excluding a case from
computation of the regression statistics changes the coefficients substantially.
We can detect the cases, which are highly influence the estimated
coefficients.
– Leverage values: Measures the influence of a point on the fit of the
regression. 0 means, no influence on the fit.
• Faculty of Economics
• Institute of Economic Theory and Methodology
Save distances
Analyze/Regression/Linear
• Faculty of Economics
• Institute of Economic Theory and Methodology
If
𝒑𝒓𝒐𝒃𝑴𝑫 < 𝟎, 𝟎𝟎𝟏
the case is outlier!
• Faculty of Economics
• Institute of Economic Theory and Methodology
It significantly
influences the fit
of the model,
but the Cook's
distance is small.
It unduly
influences the
model.
• Faculty of Economics
• Institute of Economic Theory and Methodology
Multicollinearity
• It is an undesirable situation when one independent
variable (X) is a linear function of other independent
variables (X).
• It is a kind of redundancy
• Check:
– Multiple coefficient of determination (R2)
– F-test
– VIF-indicator
• Fixing:
– Principal component analysis
– Removing variable
• Faculty of Economics
• Institute of Economic Theory and Methodology
VIF-indicator
1
• Formula: 𝑉𝐼𝐹𝑗 =
1 −𝑅𝑗2
• Limits: 1 < 𝑉𝐼𝐹 ≤ ∞
– If 𝑅𝑗2 = 0 → 𝑉𝐼𝐹𝑗 = 1
The jth independent variable doesn’t correlate with the others.
– Ha 𝑅𝑗2 = 1 → 𝑉𝐼𝐹𝑗 = ∞
The jth independent variable is an exact linear combination of
other independent variables.
• Rating:
1 < 𝑉𝐼𝐹 ≤ 2 → 𝑤𝑒𝑎𝑘 𝑚𝑢𝑙𝑡𝑖𝑐𝑜𝑙𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦
2 < 𝑉𝐼𝐹 ≤ 5 → 𝑠𝑡𝑟𝑜𝑛𝑔, 𝑑𝑖𝑠𝑡𝑢𝑟𝑏𝑖𝑛𝑔 𝑚𝑢𝑙𝑡𝑖𝑐𝑜𝑙𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦
5 < 𝑉𝐼𝐹 → 𝑣𝑒𝑟𝑦 𝑠𝑡𝑟𝑜𝑛𝑔, ℎ𝑎𝑟𝑚𝑓𝑢𝑙 𝑚𝑢𝑙𝑡𝑖𝑐𝑜𝑙𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦
• Faculty of Economics
• Institute of Economic Theory and Methodology
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿𝟏 + 𝜷𝟐 𝑿𝟐𝟏 + ⋯ + 𝜷𝒑 𝑿𝒑 + 𝜺
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿𝟏 + 𝜷𝟐 𝑿𝟐 + 𝜷𝟑 (𝑿𝟏 ∗ 𝑿𝟐 ) + ⋯ + 𝜺
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝐥𝐧(𝑿𝟏 ) + ⋯ + 𝜷𝒑 𝑿𝒑 + 𝜺
𝜷𝟏
– 𝜷𝟏 : 1% change in 𝑿𝟏 is associated with ≅ unit
𝟏𝟎𝟎
change in 𝒀, if the other independent variables
remain constant (ceteris paribus).
• Faculty of Economics
• Institute of Economic Theory and Methodology
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑫𝟏 + 𝜷𝟐 𝑿𝟐 + ⋯ 𝜷𝒑 𝑿𝒑 + 𝜺
Dummy variable
Transform/Recode into Different Variables…
Share of receiving lunch subsidy (%):
0 → Less than 20%
1 → Higher than 20%
• Faculty of Economics
• Institute of Economic Theory and Methodology
Analyze/Regression/Linear
New population equation (6):
𝒕𝒐𝒕𝒔𝒄𝟖 = 𝜷𝟎 + 𝜷𝟏 ∗ 𝒕𝒄𝒉𝒓𝒂𝒕𝒊𝒐 + 𝜷𝟐 ∗ 𝒓𝒆𝒈𝒅𝒂𝒚 + 𝜷𝟑 ∗ 𝒔𝒑𝒆𝒄𝒆𝒅 + 𝜷𝟒 ∗ 𝐥𝐧 𝒑𝒆𝒓𝒄𝒂𝒑 + 𝜷𝟓 ∗ 𝒍𝒏𝒄𝒉𝟐𝟎 + 𝜺
• Faculty of Economics
• Institute of Economic Theory and Methodology
5. Validation
I. Save predicted values (𝑌)
II. Activate test data
III. Compare the observed (𝑌) and estimated
values
(predicted) (𝑌)
• Faculty of Economics
• Institute of Economic Theory and Methodology
I. Save predicted values (𝑌)
Analyze/Regression/Linear
• Faculty of Economics
• Institute of Economic Theory and Methodology
Spending per pupil, regular -0,007 -0,006 -0,009 -0,008 -0,009 -0,007
(in th $) (0,006) (0,005) (0,002) (0,001) (0,001) (0,001)
Outlier Not filtered Not filtered Not filtered Not filtered Not filtered Filtered Filtered
𝟔𝟎𝟓, 𝟓𝟒: If every independent variable’s value 𝑋𝑝 would be 0, than the 8th grade score (𝑌)
would be 605,54 points on average.
−𝟏, 𝟗𝟗: If Student per teacher ratio 𝑋1 is higher by 1 person, than the 8th grade score (𝑌) is
1,99 points lower on average, if every independent variable remain constant(ceteris paribus).
−𝟎, 𝟎𝟎𝟕: If Spending per pupil, regular 𝑋2 is higher by 1 thousand $, than the 8th grade score
(𝑌) is 0,007 points lower on average, if every independent variable remain constant (ceteris
paribus).
−𝟎, 𝟔𝟑: If the Share of special education students 𝑋3 is 1 percentage point higher, than the 8th
grade score (𝑌) is 0,63 points lower on average, if every independent variable remain constant
(ceteris paribus).
𝟓𝟗, 𝟓𝟓: If the Average district per capita income 𝑋4 is 1% higher, than the 8th grade score (𝑌) is
≈0,5955 points higher on average, if every independent variable remain constant (ceteris
paribus).
−𝟏𝟎, 𝟔𝟑: In schools, where the Share of receiving lunch subsidy higher than 20% 𝑋5 = 1 , the
8th grade score (𝑌) is 10,63 points lower on average, if every independent variable remain
constant (ceteris paribus).
• Gazdaságtudományi Kar
• Gazdaságelméleti és Módszertani Intézet