Basic Econometrics: TWO-VARIABLEREGRESSION MODEL
Basic Econometrics: TWO-VARIABLEREGRESSION MODEL
TWO-VARIABLEREGRESSION MODEL:
THE PROBLEM OF ESTIMATION
which shows that the ûi (the residuals) are simply the differences
between the actual and estimated Y values.
1
.
Now given n pairs of observations on Y and X, we would like to determine the SRF
in such a manner that it is as close as possible
, to the actual Y. To this end, we
may adopt the following criterion: Choose the SRF in such a way that the sum of
the residuals ûi
Yi Yˆi is as small as possible. =
for a given sample, the method of least squares provides us with unique estimates
of β1 and β2 that give the smallest possible value of uˆi2
where n is the sample size. These simultaneous equations are known as the
normal equations.
2
Solving the normal equations simultaneously, we obtain
3
The regression line thus obtained has the following properties:
1. It passes through the sample means of Y and X. This fact is obvious from
(3.1.7), for the latter can be written as Y ˆ1 ˆ2 X
which is shown diagrammatically in Figure 3.2.
4
2. The mean value of the estimated Y = Yˆi is equal to the mean value of the actual Y
which is the same as (3.1.7). Subtracting Eq. (3.1.12) from (2.6.2), we obtain
5
where yi and xi , following our convention, are deviations from their respective
(sample) mean values.
Equation (3.1.13) is known as the deviation form. Notice that the intercept term ̂1
is no longer present in it. But the intercept term can always be estimated by
(3.1.7), that is, from the fact that the sample regression line passes through the
sample means of Y and X.
An advantage of the deviation form is that it often simplifies computing formulas.
In passing, note that in the deviation form, the SRF can be written as
6
3.2 THE CLASSICAL LINEAR REGRESSION MODEL:
THE ASSUMPTIONS UNDERLYING THE METHOD OF LEAST SQUA
7
Assumption 3: Zero mean value of disturbance ui. Given the
value of X, the mean, or expected, value of the random
disturbance term ui is zero. Technically, the conditional mean
value of ui is zero. Symbolically, we have
9
In contrast, consider Figure 3.5, where the conditional variance of the Y population
varies with X. This situation is known appropriately as heteroscedasticity, or
unequal spread, or variance. Symbolically, in this situation (3.2.2) can be written as
Notice the subscript on σ2 in Eq. (3.2.3), which indicates that the variance of the Y
population is no longer constant.
where i and j are two different observations and where cov means
covariance..
10
Technically, this is the assumption of no serial correlation, or no autocorrelation.
This means that, given Xi , the deviations of any two Y values from their mean value
do not exhibit patterns such as those shown in Figure 3.6a and b. In Figure 3.6a, we
see that the u’s are positively correlated, a positive u followed by a positive u or a
negative u followed by a negative u. In Figure 3.6b, the u’s are negatively
correlated, a positive u followed by a negative u and vice versa.
If the disturbances (deviations) follow systematic patterns, such as those shown in
Figure 3.6a and b, there is auto- or serial correlation, and what Assumption 5
requires is that such correlations be absent. Figure 3.6c shows that there is no
systematic pattern to the u’s, thus indicating zero correlation.
11
Assumption 6 states that the disturbance u and explanatory variable X are uncorrelated.
12
Assumption 9: The regression model is correctly specified. Alternatively,
there is no specification bias or error in the model used in empirical analysis.
Some important questions that arise in the specification of the model include
the following:
(1) What variables should be included in the model?
(2) What is the functional form of the model? Is it linear in the parameters, the
variables, or both?
(3) What are the probabilistic assumptions made about the Yi , the Xi, and the ui
entering the model?
These are extremely important questions, for by omitting important variables
from the model, or by choosing the wrong functional form, or by making wrong
stochastic assumptions about the variables of the model, the validity of interpreting
the estimated regression will be highly questionable.
13
Our discussion of the assumptions underlying the classical linear regression model is now
completed. It is important to note that all these assumptions pertain to the PRF only and
not the SRF. But it is interesting to observe that the method of least squares discussed
previously has some properties that are similar to the assumptions we have made about
the PRF. For example, the finding that
ûi = 0, and, therefore,
û = 0, is akin to the assumption that E(ui | Xi) = 0. Likewise, the finding that
14
3.3 PRECISION OR STANDARD ERRORS OF LEAST-SQUARES ESTIMATES
Given the Gaussian assumptions the standard errors of the OLS estimate can be
obtained as follows:
where var = variance and se = standard error and where σ2 is the constant or
homoscedastic variance of ui of Assumption 4.
All the quantities entering into the preceding equations except σ2 can be estimated
from the data. σ2 itself is estimated by the following formula:
15
2
where ̂ is the OLS estimator of the true but unknown σ2 and where the expression
2
Once i is known,
ˆ
u ̂ 2 can be easily computed.
2
i
ˆ
u itself can be computed either from (3.1.2) or from the following expression
Compared with Eq. (3.1.2), Eq. (3.3.6) is easy to use, for it does not require computing
ûi for each observation although such a computation will be useful in its own right.
Since
16
In passing, note that the positive square root of ̂ 2
.
17
18
3.4 PROPERTIES OF LEAST-SQUARES ESTIMATORS: THE GAUSS–MARKOV THEOREM
19
3.5 THE COEFFICIENT OF DETERMINATION r 2: A MEASURE OF “GOODNESS OF FIT”
To compute this r 2,
20
21
Now dividing (3.5.3) by TSS on both sides, we
obtain
We now define r 2 as
or, alternatively, as
The quantity r 2 thus defined is known as the (sample) coefficient of determination and
is the most commonly used measure of the goodness of fit of a regression line. Verbally, r
2
measures the proportion or percentage of the total variation in Y explained by the
regression model.
22
Although r 2 can be computed directly from its definition given in (3.5.5), it can be
obtained more quickly from the following formula:
23
If we divide the numerator and the denominator of (3.5.6) by the sample size n (or
n− 1 if the sample size is small), we obtain
24
Therefore, we can write
25
which is known as the sample correlation coefficient.
26
27
That is,
28
29
The estimated regression line therefore is
30
31
32
33
34
35