Addis Ababa University
College of Business and Economics
Department of Economics
Econ 607: Econometrics I
6. Discrete Choice Models
Fantu Guta Chemrie (PhD)
F. Guta (CoBE) CoBE
Econ 607 February, 2019 1 / 139
6. Discrete Choice Models (3 weeks)
6.1 Binary Choice Models: Estimation and Inference
6.1.1 Linear Probability Model
6.1.2 Probit and Logit Models
6.1.3 Reporting the Results for Probit and Logit Models
6.2 Multi-Response Models
6.2.1 Multinomial Logit Model
6.2.2 Conditional Logit Model
6.2.3 Multinomial Probit Model
6.2.4 Nested Logit Model
6.2.5 Ordered Response Model
F. Guta (CoBE) Econ 607 February, 2019 2 / 139
6.3 Censored and Truncated Regression Models
6.3.1 Tobit (Censored Regression) Model
6.3.2 Truncated Regression Model
6.3.3 Heckman’s Two Steps Sample Selection
F. Guta (CoBE) Econ 607 February, 2019 3 / 139
6. Discrete Choice Models
By discrete regression models we mean those
models in which the dependent variable assumes
discrete values.
The simplest of these models is that in which the
dependent variable y is binary (it can take on only
two values, say, 0 and 1).
Things are more complicated when the dependent
variable y can assume more than two values. Then
F. Guta (CoBE) Econ 607 February, 2019 4 / 139
we have to classify the cases into categorical and
noncategorical variables.
An example of a noncategorical variable is
encountered when y denotes the number of patents
issued to a company during a year. Here y assumes
values of 0, 1, 2, . . . , but is not a categorical
variable. However, it is a discrete variable.
Categorical variables can further be classi…ed as
unordered, sequential and ordered variables.
F. Guta (CoBE) Econ 607 February, 2019 5 / 139
Example of unordered categorical variable is:
y = 1 if the mode of transport is car
y = 2 if the mode of transport is bus
y = 3 if the mode of transport is train
Example of sequential categorical variable is:
y = 1 if the individual has not completed high school
y = 2 if the individual has completed high school but not college
y = 3 if the individual has completed college but not a higher degree
y = 4 if the individual has a professional degree
F. Guta (CoBE) Econ 607 February, 2019 6 / 139
Example of order categorical variable is:
y = 1 if the individual spends less $1,000
y = 2 if the individual spends more than $1,000 but less that $3,000
y = 3 if the individual spends more than $3,000 but less that $5,000
y = 4 if the individual spends more than $5,000
6.1 Binary Choice Models: Estimation and Inference
Modelling response of the form:
8
< 1 if the response of the i th individual is Yes
yi =
: 0 if the response of the i th individual is No
F. Guta (CoBE) Econ 607 February, 2019 7 / 139
Review of Bernoulli Trials: On each trial we have
Yes with probability θ and No with probability 1 θ
and the trials are independent.
Generalization of interest: θ i = θ (xi0 β), xi0 β is the
index for the i th individual.
E (yi ) = 1 θ + 0 (1 θ) = θ
var (yi ) = E yi2 [E (yi )]2
= 12 θ + 02 (1 θ) θ2
= θ (1 θ)
F. Guta (CoBE) Econ 607 February, 2019 8 / 139
Inference: Proportion of success in n independent
∑ni=1 yi
Bernoulli trials = n is the minimum variance
unbiased estimator of θ, also the MLE of θ.
The Likelihood function:
n
∏ θy
1 yi
L (θ; y1 , y2 , . . . , yn ) = i
(1 θ)
i =1
n
θ ) ∑ i =1 i
n
= θ ∑ i = 1 yi ( 1 n y
n x
= θ x (1 θ) if x = ∑ni=1 yi
ln L = (∑ni=1 yi ) ln θ + (n n
∑i =1 yi ) ln (1 θ)
F. Guta (CoBE) Econ 607 February, 2019 9 / 139
Now, it can easily be shown that the MLE of
b
θ= 1 n
∑i =1 yi .
n
Allow θ to be non-constant, i.e., θ i = θ (xi0 β).
The three popular speci…cations for this relationship:
1). θ i = xi0 β : Linear Probability Model, 0 θi 1
2). θ i = Φ (xi0 β) : a probit model if Φ is the
distribution function of N (0, 1).
3). θ i = F (xi0 β) : a logit model if F is the logistic
distribution function.
F. Guta (CoBE) Econ 607 February, 2019 10 / 139
6.1.1 Linear Probability Model
The term linear probability model is used to denote
a regression model in which the dependent variable
y is a binary variable taking the values of 1 if the
event occurs and 0 otherwise.
Examples include participation in the labour force,
the decision to marry, etc. We write the model in
the usual regression framework as:
yi = xi0 β + ui (6.1.1)
F. Guta (CoBE) Econ 607 February, 2019 11 / 139
The calculated value of y from the regression
equation ybi = xi0 b
β, will then give the estimated
probability that the event will occur given the
particular value of x.
In practice, these estimated probabilities can be
outside the admissible range (0, 1).
Because yi takes the value of 1 or 0, the residuals in
equation (6.1.1) can take only two values: 1 xi0 β
and xi0 β.
F. Guta (CoBE) Econ 607 February, 2019 12 / 139
Thus we have
ui f (ui )
1 xi0 β xi0 β
xi0 β (1 xi0 β)
Hence
2 2
Var (ui ) = (xi0 β) (1 xi0 β) + (1 xi0 β) (xi0 β)
= (xi0 β) (1 xi0 β)
= E (yi ) [1 E (yi )]
F. Guta (CoBE) Econ 607 February, 2019 13 / 139
Because of this heteroscedasticity problem, the OLS
estimate of β from (6.1.1) will not be e¢ cient.
Goldberger (1964, p. 250) suggested the following
procedure: First, estimate (6.1.1) by OLS. Next,
compute ybi (1 ybi ) and use weighted least squares;
i.e., de…ning
1/2
wi = [ybi (1 ybi )]
We regress yi /wi on xi /wi .
The problems with this procedure are the following:
F. Guta (CoBE) Econ 607 February, 2019 14 / 139
1). In practice, ybi (1 ybi ) may be negative, although in
large samples there is very small probability that
this will be so.
2). Because the residuals ui are not normally
distributed, the least squares method is not, in
general, fully e¢ cient.
3). The most important criticism is that the E ( y j x)
interpreted as the probability that the event will
occur can lie outside the limits (0, 1).
F. Guta (CoBE) Econ 607 February, 2019 15 / 139
6.1.2 Probit and Logit Models
An alternative approach, called by Goldberger
(1964) the Probit analysis model, is to assume that
there is an underlying response variable yi de…ned
by the regression relationship
yi = xi0 β + ui (6.1.2)
In practice, yi is unobservable. What we observe is
a dummy variable y de…ned by
F. Guta (CoBE) Econ 607 February, 2019 16 / 139
8
< 1 if y > 0
i
y= (6.1.3)
: 0 Otherwise
In this formulation, xi0 β is not E ( yi j xi ) as in the
linear probability model; it is E ( yi j xi ).
From the relations (6.1.2) and (6.1.3) we get
Prob (yi = 1) = Prob (ui > xi0 β)
= 1 F ( xi0 β) (6.1.4)
where F is the cumulative distribution function for
u.
F. Guta (CoBE) Econ 607 February, 2019 17 / 139
Hence the Likelihood function is
L= ∏F( xi0 β) ∏ [1 F ( xi0 β)] (6.1.5)
yi = 0 yi = 1
The functional form of F is (6.1.5) will depend on
the assumptions made about ui in (2.1.2).
If the cumulative distribution of ui is logistic, we
have the logit model. In this case,
exp ( xi0 β) 1
F( xi0 β) = 0 =
1 + exp ( xi β) 1 + exp (xi0 β)
Hence,
F. Guta (CoBE) Econ 607 February, 2019 18 / 139
exp(xi0 β )
1 F ( xi0 β) = (6.1.6)
1+exp(xi0 β )
In this case we say that there is a closed-form
expression of F , because it does not involve
integrals explicitly. Not all distributions permit such
a closed-form expression.
For instance, in the probit model (or, more
accurately, the normal model) we assume that ui are
IN (0, σ2 ). In this case,
Z xi0 β/σ 1 tt
F( xi0 β) = p exp dt (6.1.7)
∞ 2π 2
F. Guta (CoBE) Econ 607 February, 2019 19 / 139
It can easily be seen from (6.1.7) and the likelihood
function (6.1.5) that we can estimate only β/σ,
and not β and σ separately. Hence we might
assume σ = 1 to start with.
Because the cumulative normal distribution and the
logistic distribution are very close to each other,
except at the tails, we are not likely to get very
di¤erent results using logit or probit method.
However, the estimates of β from the two methods
are not directly comparable.
F. Guta (CoBE) Econ 607 February, 2019 20 / 139
Because the logistic distribution has a variation
π 2 /3, the estimates of β from the logit model have
p
3
to be multiplied by π to be comparable to the
estimates obtained from from the probit model
(where we normalize σ = 1).
Amemiya (1981) suggested that the logit estimates
p
3
be multiplied by 1/1.6 = 0.625, instead of π
saying that this transformation produces a closer
approximation between the logistic distribution and
F. Guta (CoBE) Econ 607 February, 2019 21 / 139
the distribution function of the standard normal.
He also suggested that the coe¢ cients of the linear
probability model b
βLP and the coe¢ cients of the
logit model, b
βL are related by the relationships
b
βLP ' 0.25b
βL except for the constant term
b
βLP ' 0.25b
βL + 0.5 for the constant term
Thus, if we need to make b
βLP comparable to the
probit coe¢ cients, we need to multiply them by 2.5
and subtract 1.25 from the constant term.
F. Guta (CoBE) Econ 607 February, 2019 22 / 139
An alternative way of comparing the models would
be to
a). Calculate the sum of squared deviations from the
predicted probabilities,
b). Compare the percentages correctly predicted, and
c). Look at the derivatives of the probabilities with
respect to a particular independent variable.
Let xik be the k th element of the vector of
explanatory variables xi , and let βk be the k th
F. Guta (CoBE) Econ 607 February, 2019 23 / 139
element of β. Then the derivatives for the
probabilities given by the linear probability, probit
and logit models are, respectively
∂
[xi0 β] = βk
∂xik
∂
Φ (xi0 β) = φ (xi0 β) βk
∂xik
∂ 0 exp (xi0 β)
L (xi β) = βk
∂xik [1 + exp (xi0 β)]2
These derivative will be needed for predicting the
e¤ects of changes in one of the independent
F. Guta (CoBE) Econ 607 February, 2019 24 / 139
variables on the probability of belonging to a group.
The likelihood function (6.1.5) can be written as
!1 yi !y i
n
1 exp xi0 β
L = ∏ 1 + exp xi0 β 1 + exp xi0 β
(6.1.8)
i =1
exp β0 ∑ni=1 xi yi
= (6.1.9)
∏ni=1 1 + exp β0 xi
De…ne t = ∑ni=1 xi yi . To …nd the
maximum-likelihood (ML) estimate of β, we have
n
0
βt ∑ ln 1 + exp β0 xi
i =1
Hence, ∂ ln L/∂β = 0 gives
F. Guta (CoBE) Econ 607 February, 2019 25 / 139
n exp( β0 xi )
S ( β) = ∑i =1 1+exp β0 x xi +t = 0 (6.1.10)
( i)
These equations are nonlinear in β. Hence, we have
to use Newton-Raphson method or the scoring
method to solve the equations.
The information Matrix is
∂2 ln L
I ( β) = E
∂β∂β0
n exp β0 xi
= ∑ 0
x x0
2 i i
(6.1.11)
i =1 1 + exp β xi
F. Guta (CoBE) Econ 607 February, 2019 26 / 139
Starting with some initial value of β, say β0 ,
compute the values of S ( β0 ) and I ( β0 ). Then the
new estimate of β is, by the method of scoring,
1
β1 = β0 + [I ( β0 )] S ( β0 )
In practice, we divide both S ( β0 ) and I ( β0 ) by n,
the sample size. This iterative procedure is repeated
until convergence.
I ( β) is positive de…nite at each stage of iteration.
F. Guta (CoBE) Econ 607 February, 2019 27 / 139
Hence, the iterative procedure will converge to a
maximum of the likelihood function, no matter the
starting value is.
The …nal converged estimates are denoted by b
β,
then the asymptotic covariance matrix is estimated
h i 1
by I β b .
These estimated variances and covariances will
enable us to test hypothesis about the di¤erent
elements of b
β.
F. Guta (CoBE) Econ 607 February, 2019 28 / 139
After estimating b
β, we can get estimated values of
the probability that the i th observation is equal to 1.
Denoting these estimated values by b
pi , we have
0
exp b
β xi
b
pi = 0
1 + exp b
β xi
Equation (6.1.10) shows that
n
pi xi = ∑ni=1 yi xi
∑ i =1 b (6.1.12)
Thus, if xi includes a constant term, then the sum
F. Guta (CoBE) Econ 607 February, 2019 29 / 139
of the estimated probabilities is equal to the number
of observations in the sample for which yi = 1, i.e.,
n
pi = ∑ni=1 yi .
∑ i =1 b
In other words, the predicted frequency is equal to
the actual frequency.
Similarly, if xi includes a dummy variable, say 1 for
male, 0 for female, then the predicted frequency will
be equal to the actual frequency for each sex group.
Similar conclusions follow for the linear probability
F. Guta (CoBE) Econ 607 February, 2019 30 / 139
model by virtue of the fact that (6.1.12) are the
least squares normal equations in that case.
After estimating b
β and then b
pi by the logit model, it
is always good practice to check whether or not
equations (6.1.12) are satis…ed.
Let us denote by φ ( ) and Φ ( ) the density
function and the distribution function, respectively,
of the standard normal. Then for the probit model
the likelihood function corresponding to (6.1.8) is
F. Guta (CoBE) Econ 607 February, 2019 31 / 139
n
L = ∏i =1 Φ β0 xi
yi 1 yi
1 Φ β0 xi
and the log-likelihood is
n n
log L = ∑ yi log Φ β0 xi + ∑ [1 yi ] log 1 Φ β0 xi
i =1 i =1
Di¤erentiating log L with respect to β yields
∂ log L n yi Φ β0 xi
S ( β) = =∑ 0 φ β0 xi xi
∂β i =1 Φ β xi 1 Φ β0 xi
The ML estimator b
βML can be obtained as a
solution of the equations S b
βML = 0.
F. Guta (CoBE) Econ 607 February, 2019 32 / 139
These equations are nonlinear in β, and thus we
have to solve them by an iterative procedure.
The information matrix is
∂2 log L
I ( β) = E
∂β∂β0
2
n φ β0 xi
= ∑Φ 0
β xi 1 0
Φ β xi
xi xi0
i =1
As with the logit model, we start with an initial
value of β, say β0 , and compute the values S ( β0 )
and I ( β0 ).
F. Guta (CoBE) Econ 607 February, 2019 33 / 139
Then the new estimate of β is, by the method of
scoring,
1
β1 = β0 + [I ( β0 )] S ( β0 )
Note that I ( β) is positive de…nite at each stage of
iteration.
Hence, the iterative procedure will converge to a
maximum of the likelihood function, no matter the
starting value is.
F. Guta (CoBE) Econ 607 February, 2019 34 / 139
The …nal converged estimates are denoted by b
β,
then the asymptotic covariance matrix is estimated
h i 1
by I β b .
These can be used to conduct any tests of
signi…cance.
6.1.3 Reporting the Results for Probit and Logit
Several statistics should be reported routinely in any
probit or logit (or other binary choice) analysis.
F. Guta (CoBE) Econ 607 February, 2019 35 / 139
The b
βj , their standard errors, and the value of the
loglikelihood function are reported by software
packages that do binary response analysis.
The b
βj give the signs of the partial e¤ects of the xj
on the response probability, and the statistical
signi…cance of xj is determined by whether we can
reject H0 : βj = 0.
One measure of goodness of …t that is sometimes
reported is the percent correctly predicted.
F. Guta (CoBE) Econ 607 February, 2019 36 / 139
The easiest way to describe this statistic is to de…ne
a binary predictor of yi to be one if the predicted
probability is at least 0.5 and zero otherwise.
More precisely, de…ne the binary variable yei = 1 if
b
F xi β b < 0.5. Given
0.5 and yei = 0 if F xi β
fyei : i = 1, 2, . . . , N g, we can see how well yei
predicts yi across all observations.
There are four possible outcomes on each pair,
(yi , yei ); when both are zero or both are one, we
make the correct prediction.
F. Guta (CoBE) Econ 607 February, 2019 37 / 139
In the two cases where one of the pair is zero and
the other is one, we make incorrect prediction. The
percent correctly predicted is the percent of times
that yei = yi . (This goodness-of-…t measure can be
computed for the linear probability model, too.)
Some have criticized the prediction rule described
above for always using a threshold value of 0.5. One
alternative is to use the fraction of success in the
sample as the threshold.
F. Guta (CoBE) Econ 607 February, 2019 38 / 139
Another possibility is to choose the threshold such
that the fraction of yei = 1 in the sample is the same
(or very close) to y. In other words, search over
threshold values τ, 0 < τ < 1, such that if we
b
de…ne yei = 1 when F xi β τ, then
n N
∑i =1 yei ∑i =1 yi .
McFadden (1974) suggests pseudo-R-squared
measure for binary response given by 1 `ur /`0 ,
where `ur is the log-likelihood function for the
F. Guta (CoBE) Econ 607 February, 2019 39 / 139
estimated model and `0 is the log-likelihood
function in the model with only an intercept.
Because the log-likelihood for binary response model
is always negative, j`ur j j`0 j , and so the
pseudo-R-squared is always between zero and one.
Alternatively, we can use a sum of squared residuals
measure: 1 SSRur /SSR0 , where SSRur is the sum
of squared residuals b
ui = yi b and SSR0 , is
G xi β
the total sum of squares of yi .
F. Guta (CoBE) Econ 607 February, 2019 40 / 139
Several other measures have been suggested (see,
for example, Maddala, 1983, Chap.2), but goodness
of …t is not as important as statistical and economic
signi…cance of the explanatory variables.
Usually we want to estimate e¤ects of the variable
xj on the response probability P ( y = 1j x).
If xj is (roughly) continuous, then
h i
\
∆ P ( y = 1j x ) b b
f x β βj ∆xj (6.1.13)
for small changes in xj .
F. Guta (CoBE) Econ 607 February, 2019 41 / 139
Therefore, the estimated partial e¤ect of a
continuous variable on the response probability,
evaluated at x, is given by f xb
β b
βj .
b depends on x, we need to decide
Because f x β
which partial e¤ects to report.
Often the sample averages of the xj are plugged in
b , with x 1 = 1 because we include a
to get f x β
constant. We call the resulting partial e¤ect the
partial e¤ect at the average (PEA).
F. Guta (CoBE) Econ 607 February, 2019 42 / 139
PEA have drawbacks. First, it need not represent
the partial e¤ect for any particular unit in the
population.
Another issue is that, if x contains nonlinear
functions of underlying variables, such as
logarithms, we must decide whether to use the
average of the nonlinear function or the nonlinear
function of the average.
The latter has some appeal, but software packages
F. Guta (CoBE) Econ 607 February, 2019 43 / 139
(such as Stata, with its mfx, for "marginal e¤ects,"
command) use the average of the nonlinear
functions (because one must create the nonlinear
functions before including them in logit or probit).
If two or more elements of x are functionally related,
such as quadratics or interactions, it is not even
clear what the PEAs of individual coe¢ cients mean.
For example, suppose xK 1 = age and xK = age 2 .
Then the reported PEAs for age and age 2 are
F. Guta (CoBE) Econ 607 February, 2019 44 / 139
b b
f xβ βK b b
and f x β βK , respectively, where
1
x = 1, x 2 , . . . , x K 2
2 , age, age .
These PEAs do not tell us what we want to know
about the partial e¤ect of age on P ( y = 1j x). For
any x, the estimated partial e¤ect is
b
f xβ b
βK + 2b
βK age . Now, we might be
1
interested in evaluating this partial e¤ect at the
mean values, but that would entail using age 2 ,
rather than age 2 , inside f ( ).
F. Guta (CoBE) Econ 607 February, 2019 45 / 139
If we are really interested in the e¤ect of age on the
response probability, we might want to evaluate the
partial e¤ect at several di¤erent values of age,
perhaps evaluating the other explanatory variables
at their means.
For discrete variables, it is known that the average
need not even be a possible outcome of the
variable. For example, if x2 = female is a gender
dummy, then the PEA is the partial e¤ect when
F. Guta (CoBE) Econ 607 February, 2019 46 / 139
female is replaced with the fraction of women in the
sample.
One way to overcome this conceptual problem is to
compute the partial e¤ects separately for x2 = 1
and x2 = 0.
Standard errors of the partial e¤ects in equation
(6.1.13) can be obtained using the delta method.
Consider the case j = K , and for a given x, de…ne
δK = βK f (xβ) = ∂P ( y = 1j x) /∂xK .
F. Guta (CoBE) Econ 607 February, 2019 47 / 139
Write this relation as δK = h ( β) to denote that
this is a (nonlinear) function of the vector β. We
assume x1 = 1. The gradient of h ( β) is
df df df
r β h ( β ) = βK ( x β ) , β K x2 ( x β ) , . . . , β K xK (x β ) + f (x β )
d β1 d β2 d βK
The delta method implies the asymptotic variance
of b
δK is estimated as
b r β h ( β) 0
r β h ( β) V (6.1.14)
b
b is the asymptotic variance estimate of β.
where V
F. Guta (CoBE) Econ 607 February, 2019 48 / 139
The asymptotic standard error of b
δK is the square
root of the expression (6.1.14). Stata does this
calculation for logit and probit using the mfx
command.
If xK is a discrete variable, then we can estimate the
change in the predicted probabilities in going from
cK to cK + 1 as
h i
b = F bβ1 + b
β2 x 2 + +b +b
βK (cK + 1 )
δK βK 1x K 1
h i
F bβ1 + b
β2 x 2 + +b
βK 1x K 1 +b
βK cK (6.1.15)
F. Guta (CoBE) Econ 607 February, 2019 49 / 139
An alternative way to summarize the estimated
marginal e¤ects is to estimate the average value of
βK f (xβ) across the population, or βK E [f (xβ)].
This quantity is the average partial e¤ect (APE).
A consistent estimator of the APE is
" #
n
β n 1 ∑ f xi β b
K (6.1.16)
i =1
when xK is continuous or
n h i
n 1
∑ F b
β1 + b
β2 x i 2 + +b
βK 1 x i ,K 1 +b
βK F b
β1 + b
β2 x i 2 + +b
βK 1 x i ,K 1 (6.1.17)
i =1
when xK is binary.
F. Guta (CoBE) Econ 607 February, 2019 50 / 139
If some elements of x are functions of each other,
obtaining APEs of the form in equation (6.1.17) is
not useful. If, say, xK 1 = age and xK = age 2 , we
can estimate the APE of age by averaging the
individual partial e¤ects
b
βK + 2b
βK agei b , across i.
f xi β
1
Again, it probably makes more sense to evaluate the
partial e¤ect at di¤erent values of age and then to
average these across the other variables, say
F. Guta (CoBE) Econ 607 February, 2019 51 / 139
n
∑ b 2
n 1
βK 1 + 2b
βK age 0 f b
β1 + b
β2 x i 2 + +b
βK 2 x i ,K 1 +b
βK 1 age
0
+b
βK (age 0 )
i =1
for a given value of age 0 .
6.2 Multi-response Models
Multinomial response models are unordered discrete
response models with more than two outcomes.
Unordered choice models can be motivated by a
random utility model. For the i th consumer faced
with J + 1 choices, suppose that the utility of
choice j is
F. Guta (CoBE) Econ 607 February, 2019 52 / 139
Uij = zij0 θ + eij (6.2.1)
If the consumer makes choice j in particular, then
we assume that Uij is the maximum among the
J + 1 utilities. Hence, the statistical model is driven
by the probability that choice j is made, which is
Prob(Uij > Uik ) for all other k 6= j.
The model is made operational by a particular
choice of distribution for the disturbances.
F. Guta (CoBE) Econ 607 February, 2019 53 / 139
As in the binary choice case, two models are usually
considered, logit and probit.
Because of the need to evaluate multiple integrals
of the normal distribution, the probit model has
found rather limited use in this setting.
The logit model, in contrast, has been widely used
in many …elds, including economics.
6.2.1 Multinomial Logit Model
This model applies when a unit’s response or choice
F. Guta (CoBE) Econ 607 February, 2019 54 / 139
depends on individual characteristics of the unit
but not on attributes of the choices.
Let y denote a random variable taking on the values
f0, 1, . . . , J g for J a positive integer, and let x
denote a set of conditioning variables.
For example, if y denotes occupational choice, x
can contain things like education, age, gender, race,
and marital status. As usual, (xi , yi ) is a random
draw from the population.
F. Guta (CoBE) Econ 607 February, 2019 55 / 139
As in the binary response case, we are interested in
how changes in the elements of x a¤ect the
probabilities of response, P (y = j jx ) , j = 0, 1, . . . , J.
Let x be a 1 K vector with …rst-element equal to
unity. The multinomial logit (MLN) model has
response probabilities
exp x βj
P (y = j jx ) = , j = 1, . . . , J (6.2.2)
1 + ∑Jh =1 exp (x βh )
where βj is K 1, j = 1, . . . , J.
F. Guta (CoBE) Econ 607 February, 2019 56 / 139
Because the response probabilities must sum to 1,
.h i
1 + ∑h =1 exp (x βh )
J
P (y = 0 jx ) = 1
The partial e¤ects for this model are complicated.
For continuous xk , we can write
( )
∂P (y = j jx ) ∑Jh =1 βhk exp (x βh )
= P (y = j jx ) βjk (6.2.3)
∂ xk g (x, β)
where βhk is the k th element of βh and
g (x, β) = 1 + ∑h=1 exp (xβh )
J
F. Guta (CoBE) Econ 607 February, 2019 57 / 139
Equation (6.2.3) shows that even the direction of
the e¤ect is not determined entirely by βjk . A
simpler interpretation of βj is given by
pj (x, β) /p0 (x, β) = exp x βj , j = 1, . . . , J (6.2.4)
where pj (x, β) denotes the response probability in
(6.2.2).
Thus the change in pj (x, β) /p0 (x, β) is approximately
βjk exp x βj ∆xk for roughly continuous xk .
F. Guta (CoBE) Econ 607 February, 2019 58 / 139
Equivalently, the log-odds ratio is linear in x:
log pj (x, β) /p0 (x, β) = x βj extends to general j and h:
log pj (x, β) /ph (x, β) = x βj βh .
Here is another useful fact about the multinomial
logit model. Since
P (y = j or y = h jx ) = pj (x, β) + ph (x, β) ,
pj (x, β)
P (y = j jy = j or y = h, x ) =
pj (x, β) + ph (x, β)
h i
= Λ x βj βh
F. Guta (CoBE) Econ 607 February, 2019 59 / 139
where Λ ( ) is the logistic function.
In other words, conditional on the choice being
either j or h, the probability that the outcome is j
follows a standard logit model with parameter
vector βj βh .
Since we have fully speci…ed the density of y given
x, estimation of the MNL model is best carried out
by maximum likelihood. For each i the conditional
log likelihood can be written as
F. Guta (CoBE) Econ 607 February, 2019 60 / 139
`i ( β) = ∑Jj=0 1 [yi = j ] log [pj (xi , β)]
where the indicator function selects out the
appropriate response probability for each
observation i.
As usual, we must estimate β by maximizing
n
∑i =1 `i ( β). McFadden (1974) has shown that the
log-likelihood function is globally concave, and this
fact makes the maximization problem
straightforward.
F. Guta (CoBE) Econ 607 February, 2019 61 / 139
6.2.2 Conditional Logit Model (Due to MacFadden)
Suppose the J + 1 disturbances are independent
and identically distributed with Gumbel (type 1
extreme value) distributions,
F (eij ) = exp f exp ( eij )g
Suppose also that an individual faces J + 1 choices then
when the data consist of choice-speci…c attributes instead of
individual-speci…c characteristics, the natural model
F. Guta (CoBE) Econ 607 February, 2019 62 / 139
formulation would be
exp (xij β)
P (yi = j jxi ) = J
, j = 0, . . . , J (6.2.5)
∑h =0 exp (xih β)
The response probability in equation (6.2.5)
constitute what is usually called the conditional
logit (CL) model.
Dropping the subscript i and di¤erentiating shows
that the marginal e¤ects are given by
∂pj (x)
= pj (x) 1 pj (x) βk , j = 0, . . . , J, k = 1, . . . , K (6.2.6)
∂xjk
F. Guta (CoBE) Econ 607 February, 2019 63 / 139
and
∂pj (x)
= pj (x) ph (x) βk , j 6= h, k = 1, . . . , K (6.2.7)
∂xhk
where pj (x) is the response probability in (6.2.5)
and βk is the k th elements of β.
The CL and MNL model have similar response
probabilities, but they di¤er in some important
respects.
In the MNL model, the conditioning variables do
F. Guta (CoBE) Econ 607 February, 2019 64 / 139
not change across alternatives: for each i, xi
contains variables speci…c to the individual but not
to alternatives.
This model is appropriate for problems where
characteristics of the alternatives are unimportant or
are not of interest, or where the data are not
available.
The CL model is intended speci…cally for problems
where consumer or …rm choices are at least partly
F. Guta (CoBE) Econ 607 February, 2019 65 / 139
based on observable attributes of each alternative.
The CL model is very important for modelling
probabilistic choice, but has some limitations. An
important restriction is
pj (x) exp (xj β)
= = exp [(xj xh ) β] (6.2.8)
ph (x) exp (xh β)
so relative probabilities for any two alternatives
depend only on the attributes of those two
alternatives.
F. Guta (CoBE) Econ 607 February, 2019 66 / 139
This is called the independence from irrelevant
alternatives (IIA) assumption because it implies
that adding another alternative or changing the
characteristics of a third alternative does not a¤ect
the relative odds between alternative j and h.
This implication is often implausible, especially for
applications with similar alternatives.
Several models that relax the IIA assumption have
been suggested.
F. Guta (CoBE) Econ 607 February, 2019 67 / 139
In the context of the random utility model, the IIA
assumption comes about because the
feij : j = 0, 1, . . . , . . . , J g are assumed to be
independent Weibull random variables.
6.2.3 Multinomial Probit Model.
A more ‡exible assumption is that ei has a
multivariate normal distribution with arbitrary
correlations between eij and eih , all j 6= h.
F. Guta (CoBE) Econ 607 February, 2019 68 / 139
The resulting model is called the multinomial
probit model. (Conditional probit model is a
better name)
Theoretically, the multinomial probit is attractive,
but computationally di¢ cult.
The response probabilities are very complicated,
involving a (J + 1) dimensional integrals.
This complexity not only makes it di¢ cult to obtain
the partial e¤ects on the response probabilities, it
F. Guta (CoBE) Econ 607 February, 2019 69 / 139
also makes MLE infeasible for more than about …ve
alternatives.
However, recent advances in estimation through
simulation make multinomial probit estimation
feasible for many alternatives.
6.2.4 Nested Logit Model (NLM)
NLM is the most popular hierarchal modeling
approach to relaxing IIA.
F. Guta (CoBE) Econ 607 February, 2019 70 / 139
We illustrate the basic approach where there are
only two hierarchies.
Suppose that the total number of alternatives can
be put into S groups (community choice) of similar
alternatives, and let Gs denote the alternatives (type
of dwelling within communities) within groups.
The …rst hierarchy corresponds to which of the S
groups y falls into, and the second corresponds to
the actual alternative within each group.
F. Guta (CoBE) Econ 607 February, 2019 71 / 139
MacFadden (1981) studied the model
h i ρs
αs ∑j 2G s exp ρs 1 xj0 β
P ( y 2 Gs j x) = h i ρr (6.2.9)
∑Sr=1 αr ∑j 2G r exp ρr 1 xj0 β
exp ρs 1 xj0 β
and P ( y = j j y 2 Gs , x) = (6.2.10)
∑h 2G r exp ρr 1 xh0 β
where equation (6.2.9) is de…ned for s = 1, 2, . . . , S
while equation (6.2.10) is de…ned for j 2 Gs and
s = 1, 2, . . . , S.
Of course, if j 2
/ Gs , P( y = j j y 2 Gs , x) = 0.
F. Guta (CoBE) Econ 607 February, 2019 72 / 139
This model requires a normalization restriction,
usually α1 = 1.
Equation (6.2.9) gives the probability that the
outcome is in groups s (conditional on x); then,
conditional on y 2 Gs , equation (6.2.10) gives the
probability of choosing alternative j within Gs .
The response probability P( y = j j x), which is
ultimately of interest, is obtained by multiplying
equations (6.2.9) and (6.2.10).
F. Guta (CoBE) Econ 607 February, 2019 73 / 139
This model can be derived by specifying the joint
distribution for the disturbances in equation (6.2.1).
Equation (6.2.11) implies that, conditional on
choosing group s, the response probabilities take a
CL form with parameter vector ρs 1 β.
This suggest a natural two-step procedure. First
estimate λs = ρs 1 β, s = 1, 2, . . . , S, by applying
CL analysis separately to each of the groups.
b s into equation (6.2.9) and
Then, plug the λ
F. Guta (CoBE) Econ 607 February, 2019 74 / 139
estimate αs , s = 1, 2, . . . , S by maximizing the log-
likelihood function
n S
∑ ∑ 1 [yi 2 Gs ] log [qs (xi ; λ, α, ρ)]
i =1 s =1
where qs (x; λ, α, ρ) is the probability in equation
(6.2.9) with λs = ρs 1 β.
This two-step conditional MLE is consistent and
p
n asymptotically normal under general regularity
conditions.
F. Guta (CoBE) Econ 607 February, 2019 75 / 139
Of course, we can also use full MLE. The log-
likelihood for observation i can be written as:
( )!
S
`i ( β, α, ρ ) = ∑ 1 [yi 2 G s ] log [q s (xi ; β, α, ρ)] + ∑ 1 [yi = j ] log [p sj (xi ; β, ρs )]
s =1 j 2G s
where qs (xi ; β, α, ρ) is the probability in equation
(6.2.9) and psj (xi ; β, ρs ) is the probability in
equation (6.2.10).
The regularity condition for MLE are satis…ed under
weak assumptions.
When αs = 1 and ρs = 1 for all s, the nested logit
model reduces to the Econ
F. Guta (CoBE)
CL607model. February, 2019 76 / 139
Thus, a test of IIA (as well as the other
assumptions underlying the CL model) is a test of
H0 : α2 = = αS = ρ1 = = ρS = 1.
MacFadden (1987) suggests a score (LM) test,
which only requires estimation of the CL model.
6.2.5 Ordered Logit and Order Probit Models
Let y be an ordered response taking on the values
f0, 1, . . . , J g for some known integer J. The
ordered probit model for y (conditional on
F. Guta (CoBE) Econ 607 February, 2019 77 / 139
explanatory variables x) can be derived from a
latent variable model.
Assume that a latent variable y is determined by
y = xβ + e, e jx N (0, 1) (6.2.11)
where β is a K 1 and, for reasons to be seen, x
does not contain a constant.
Let α1 < α2 < < αJ be unknown cut points
(or threshold parameters), and de…ne
F. Guta (CoBE) Econ 607 February, 2019 78 / 139
y =0 if y α1
y =1 if α1 < y α2
..
. (6.2.12)
y = J if y > αJ
For example, if y takes on values 0,1, and 2, then
there are two cut points, α1 and α2 .
Given the standard normal assumption for e, it is
easy to derive the conditional distribution of y given
x; we compute each response probability:
F. Guta (CoBE) Econ 607 February, 2019 79 / 139
P (y = 0 jx ) = P (y α1 jx ) = P (x β + e α1 jx ) = Φ ( α1 x β)
P (y = 1 jx ) = P ( α1 < y α2 jx ) = Φ ( α2 x β) Φ ( α1 x β)
..
. (6.2.13)
P (y = J 1 jx ) = P ( αJ 1 <y αJ jx ) = Φ ( αJ x β) Φ ( αJ 1 x β)
P (y = J jx ) = P (y > αJ jx ) = 1 Φ ( αJ x β)
One can easily verify that these sum to unity. When
J = 1 we get the binary response model:
P (y = 1 jx ) = 1 P (y = 0 jx ) = 1 Φ ( α1 x β ) = Φ (x β α1 )
and so α1 is the intercept inside Φ.
F. Guta (CoBE) Econ 607 February, 2019 80 / 139
It is for this reason that x does not contain an
intercept in the ordered probit model.
When there are only two outcomes, we set the
single cut point to zero and estimate the intercept;
this approach leads to the standard probit model.
The parameters α and β can be estimated by MLE.
For each i, the log likelihood function is
`i ( α, β ) = 1 [yi = 0] log [Φ (α1 xi β)] + 1 [yi = 1] log [Φ (α2 xi β)
Φ ( α1 xi β)] + + 1 [yi = J ] log [1 Φ ( αJ xi β)] (6.2.14)
F. Guta (CoBE) Econ 607 February, 2019 81 / 139
This log likelihood function is well behaved, and
many statistical packages estimate ordered probit
models.
Other distribution functions can be used in place of
Φ. Replacing Φ with the logit function, Λ, gives
the ordered logit model.
In either case we must remember that β, by itself, is
of limited interest. In most most cases we are not
interested in E (y jx ), as y is an abstract
construct.
F. Guta (CoBE) Econ 607 February, 2019 82 / 139
Instead, we are interested in the response
probabilities P (y = j jx ), just as in the ordered
response case. For the ordered probit model
∂p0 (x) ∂pJ (x)
= βk φ ( α1 x β ) , = βk φ ( αJ x β )
∂xk ∂xk
∂pj (x)
= βk φ αj x β φ α j +1 x β , 0 < j < J
∂xk
and the formulas for the ordered logit model are
similar.
F. Guta (CoBE) Econ 607 February, 2019 83 / 139
Estimated response probabilities at various values of
x, such as x, can be compared for logit and probit
models.
The b
β are not directly comparable across models. In
particular, the b
αj are important determinant of the
magnitude of the estimated probabilities and partial
e¤ects.
While the direction of the e¤ect of xk on the
probabilities P (y = 0 jx ) and P (y = J jx ) is
F. Guta (CoBE) Econ 607 February, 2019 84 / 139
unambiguously determined by the sign of βk , the
sign of βk does not always determine the direction
of the e¤ect for the intermediate outcomes,
1, 2, . . . , J 1.
6.3 Censored and Truncated Regression Models
Suppose y has a normal distribution, with mean µ
and variance σ2 . Suppose we consider a sample of
size n fy1 , y2 , . . . , yn g and record only those values
of y greater than a constant c.
F. Guta (CoBE) Econ 607 February, 2019 85 / 139
For those values of y c, we record the value c.
The abbreviations are
yi = yi if yi > c
yi = c otherwise
The resulting sample y1 , y2 , . . . , yn is said to be a
censored sample. For the observations yi = c all we
know is that y c, that is,
P (yi = c ) = P (y c)
F. Guta (CoBE) Econ 607 February, 2019 86 / 139
Hence, the likelihood function for estimation of the
parameters µ and σ2 is the hybrid of the normal
regression and the probit model given by
1 yi µ c µ
L µ, σ2 y1 , , yn = ∏ σ
φ
σ ∏ Φ
σ
yi >c yi c
where φ ( ) and Φ ( ) are,respectively,the density
and the distribution function of the standard normal.
Theorem
6.1: Moments of the Censored Normal Variable:
F. Guta (CoBE) Econ 607 February, 2019 87 / 139
Theorem
If y N (µ, σ2 )] and y = c if y c or else
y = y , then
E (y ) = cΦ + (µ + σλ) (1 Φ)
h i
2
var (y ) = σ (1 2
Φ ) (1 δ) + (α λ) Φ
where
Φ [(c µ) /σ] = Φ (α) = P (y c) = Φ
λ = φ/ (1 Φ) , and δ = λ2 λα
F. Guta (CoBE) Econ 607 February, 2019 88 / 139
Now suppose that before the sample is drawn we
truncate the distribution of y at the point y = c,
so that no observations are drawn for yi > c.
All observations come from the shaded are in …gure
6.1 below.
Figure 6.1 Truncated normal distribution
F. Guta (CoBE) Econ 607 February, 2019 89 / 139
The density function of the truncated normal
distribution from which the sample is drawn is
(1/σ) φ [(y µ) /σ]
f (y j y < c) = ( ∞<y c)
Φ [(c µ) /σ ]
A sample from this truncated normal distribution is
called a truncated sample.
In practice we can have samples that are doubly
truncated, doubly censored, truncated and censored,
and so forth.
F. Guta (CoBE) Econ 607 February, 2019 90 / 139
Example
Consider a truncation at the level c1 and censoring
at the level c2 (c2 < c1 ); that is, only samples of y
with y c1 are drawn, and among those sample
only value of y > c2 are recorded.
For those observations y c2 , we record c2 ; that is
yi = yi if y > c2
yi = c2 otherwise
The likelihood function for this model is
F. Guta (CoBE) Econ 607 February, 2019 91 / 139
Example
h i n
c1 µ yi c2 µ
L µ, σ 2 y1 , , yn = Φ 1
Φ
µ
σ ∏y i >c 2 σ φ σ ∏y i c2 σ
Theorem
6.2: Moments of the Truncated Normal Distribution:
If x N (µ, σ2 )] and c is a constant, then
E (x jtruncation) = µ + σλ(α) (6.3.1)
Var (x jtruncation) = σ2 [1 δ(α)] (6.3.2)
where
F. Guta (CoBE) Econ 607 February, 2019 92 / 139
Theorem
α = (c µ)/σ, φ(α) is the standard normal density and
λ ( α ) = φ ( α ) / [1 Φ(α)] if truncation is x > c (6.3.3a)
λ(α) = φ(α)/Φ(α) if truncation is x < c (6.3.3b)
and δ(α) = λ(α)[λ(α) α] (6.3.3c)
The function λ(α) is called the inverse Mills ratio.
The function in (6.3.3a) is also called the hazard
function for the standard normal distribution.
F. Guta (CoBE) Econ 607 February, 2019 93 / 139
6.3.1 Tobit (Censored Regression) Model
The regression model based on the moments of
censored normal variable is referred to as the
censored regression model or the tobit model
[in reference to Tobin (1958), where the model was
…rst proposed].
The regression is obtained by making the mean in
the censored normal variable correspond to a
classical regression model.
F. Guta (CoBE) Econ 607 February, 2019 94 / 139
The general formulation is usually given in terms of
an index function,
yi = xi0 β + ei
8
< 0 if y 0
i
yi =
: y if y > 0
i i
There are potentially three conditional mean
functions to consider, depending on the purpose of
the study.
For the index variable, sometimes called the latent
F. Guta (CoBE) Econ 607 February, 2019 95 / 139
variable, E [yi jxi ] is xi0 β.
Consistent with Theorem 6.1, for an observation
randomly drawn from the population, which may or
may not be censored,
xi0 β
E [yi jxi ] = Φ (xi0 β + σλi ) (6.3.4)
σ
where
φ [(0 xi0 β) /σ] φ (xi0 β/σ)
λi = =
1 Φ [(0 xi0 β) /σ] Φ (xi0 β/σ)
Finally, if we intend to con…ne our attention to
F. Guta (CoBE) Econ 607 February, 2019 96 / 139
uncensored observations, then the results for the
truncated regression model apply.
The limit observations should not be discarded,
however, because the truncated regression model is
no more amenable to least squares than the
censored data model.
There are di¤erences in the partial e¤ects. For the
index variable,
∂E [yi jxi ]
= β.
∂xi
F. Guta (CoBE) Econ 607 February, 2019 97 / 139
But this result is not what will usually be of
interest, because yi is unobserved. For the observed
data, yi , the following general result will be useful:
Theorem
6.3: Partial E¤ects in the Censored Regression Model
In the censored regression model with latent
regression y = xβ + e and observed dependent
variable, y = c1 if y c1 , y = c2 if y c2 , and
y = y otherwise, where c1 and c2 are constants, let
f (e) and F (e) denote the density and cdf of e.
F. Guta (CoBE) Econ 607 February, 2019 98 / 139
Theorem
Assume that e is a continuous random variable with
mean 0 and variance σ2 , and f (ejx) = f (e). Then
∂E [y jx]
=β Prob[c1 < y < c2 ].
∂x
Note that this general result includes censoring in
either or both tails of the distribution, and it does
not assume that e is normally distributed.
For the standard case with censoring at zero and
normally distributed disturbances, the result
F. Guta (CoBE) Econ 607 February, 2019 99 / 139
specializes to
∂E [yi jxi ] xi0 β
= βΦ . (6.3.5)
∂xi σ
Although not a formal result, this does suggest a
reason why, in general, least squares estimates of
the coe¢ cients in a tobit model usually resemble
the MLEs times the proportion of nonlimit
observations in the sample.
McDonald and Mo¢ tt (1980) suggested a useful
decomposition of ∂E [yi jxi ]/∂xi ,
F. Guta (CoBE) Econ 607 February, 2019 100 / 139
∂E [yi jxi ]
∂xi =β f Φi [1 λi (αi + λi )] + φi (αi + λi )g, (6.3.6)
where αi = xi0 β/σ, Φi = Φ (αi ) and λi = φi /Φi .
Taking the two parts separately, this result
decomposes the slope vector into
∂E [yi jxi ] ∂E [yi jxi , yi > 0] ∂Prob [yi > 0]
= Prob [yi > 0] + E [yi jxi , yi > 0]
∂xi ∂xi ∂xi
(6.3.7)
Thus, a change in xi has two e¤ects: It a¤ects the
conditional mean of yi in the positive part of the
distribution, and it a¤ects the probability that the
F. Guta (CoBE) Econ 607 February, 2019 101 / 139
observation will fall in that part of the distribution.
Estimation of The Tobit Model
The log-likelihood for the censored regression model
is
" 2
#
1 yi xi0 β xi0 β
ln L = ∑
2 y >0
ln 2π + ln σ2 +
σ2
+ ∑ ln 1 Φ
σ
.
i y i =0
(6.3.8)
This likelihood is a nonstandard type, because it is a
mixture of discrete and continuous distributions.
In a seminal paper, Amemiya (1973) showed that
F. Guta (CoBE) Econ 607 February, 2019 102 / 139
proceeding in the usual fashion to maximize ln L
would produce an estimator with all the familiar
desirable properties attained by MLEs.
Olsen’s (1978) reparameterization simpli…es things
considerably. With γ = β/σ and θ = 1/σ, the
log-likelihood is
1 h i
2
ln L =
2 ∑ ln 2π ln θ 2 + θ yi xi0 γ + ∑ ln 1 Φ xi0 γ . (6.3.9)
y i >0 y i =0
The Hessian is always negative de…nite, so Newton’s
method is simple to use and converges quickly.
F. Guta (CoBE) Econ 607 February, 2019 103 / 139
The original parameters can be recovered using
σ = 1/θ and β = γ/θ.
The asymptotic covariance matrix for these
estimates can be obtained from that for the
estimates of [γ, θ ] using the delta method:
[Link][ b b]
β, σ = b b, b
J [Link][γ θ] b
J,
2 3 2 3
6 ∂β/∂γ
0 ∂β/∂θ 7 6 (1/θ ) I 1/θ 2 γ 7
where J = 6 7=6 7.
4 5 4 5
2
∂σ/∂γ0 ∂σ/∂θ 00 1/θ
Almost without exception, it is found that the OLS
F. Guta (CoBE) Econ 607 February, 2019 104 / 139
estimates are smaller in absolute value than the
MLEs.
A striking empirical regularity is that the maximum
likelihood estimates can often be approximated by
dividing the OLS estimates by the proportion of
nonlimit observations in the sample.
6.3.2 Truncated Regression Model
We now assume that µi = xi0 β is the deterministic
part of the classical regression model. Then
F. Guta (CoBE) Econ 607 February, 2019 105 / 139
yi = xi0 β
where ei jxi N (0, σ2 ) ,
so that yi jxi N (xi0 β, σ2 )
We are interested in the distribution of yi given that
yi is greater than the truncation point c. This is the
result described in Theorem 6.2. It follows that
φ c xi0 β /σ
E [yi jyi > c ] = xi0 β + σ (6.3.10)
1 Φ c xi0 β /σ
The conditional mean is therefore a nonlinear
F. Guta (CoBE) Econ 607 February, 2019 106 / 139
function of c, σ, x, and β.
The partial e¤ects in this model in the
subpopulation can be obtained by writing
E [yi jyi > c ] = xi0 β + σλ (αi ) (6.3.11)
where now αi = (c xi0 β)/σ.
Let λi = λ (αi ) and δi = δ(αi ). Then
∂ E [ yi j yi > c ] d λ (αi ) ∂αi β
= β+σ = β + σ λ2i αi λi
∂xi d αi ∂xi σ
= β 1 λ2i + αi λi = β (1 δi ) (6.3.12)
F. Guta (CoBE) Econ 607 February, 2019 107 / 139
Note the appearance of the scale factor 1 δi from
the truncated variance.
Because 1 δi is between zero and one, we
conclude that for every element of xi , the marginal
e¤ect is less than the corresponding coe¢ cient.
There is a similar attenuation of the variance.
In the subpopulation yi > c, the regression variance
is not σ2 but
Var [yi jyi > c ] = σ2 (1 δi ) .
F. Guta (CoBE) Econ 607 February, 2019 108 / 139
The result in (6.3.11) is of interest if the analysis is
to be con…ned to the subpopulation.
If the study is intended to extend to the entire
population, however, then it is the coe¢ cients β
that are actually of interest.
One’s …rst inclination might be to use OLS to
estimate the parameters of this regression model.
For the subpopulation from which the data are
drawn, we could write (6.3.10) in the form
F. Guta (CoBE) Econ 607 February, 2019 109 / 139
yi jyi > c = E [yi jyi > c ] + ui = xi0 β + σλi + ui (6.3.13)
where ui is yi minus its conditional expectation.
By construction, ui has a zero mean, but it is
heteroscedastic:
Var [ui ] = σ2 1 λ2i + αi λi = σ2 (1 δi ) ,
which is a function of xi .
If we estimate (6.3.13) by ordinary least squares
regression of y on X, then we have omitted a
variable, the nonlinear term λi .
F. Guta (CoBE) Econ 607 February, 2019 110 / 139
Without some knowledge of the distribution of x, it
is not possible to determine how serious the bias is
likely to be.
If E [xjy ] in the full population is a linear function of
y, then plim b
β = βτ for some proportionality
constant τ.
This result is consistent with the widely observed
proportionality relationship between least squares
estimates of this model and maximum likelihood
estimates.
F. Guta (CoBE) Econ 607 February, 2019 111 / 139
In applications, it is usually found that, compared
with consistent maximum likelihood estimates, the
OLS estimates are biased toward zero.
6.3.3 Heckman’s Two Steps Sample Selection
[Link] Incidental Truncation in a Bivariate Distn¯
Suppose that y and z have a bivariate distribution
with correlation ρ.
We are interested in the distribution of y given that
z exceeds a particular value.
F. Guta (CoBE) Econ 607 February, 2019 112 / 139
Intuition suggests that if y and z are positively
correlated, then the truncation of z should push the
distribution of y to the right.
We are interested in (1) the form of the incidentally
truncated distribution and (2) the mean and
variance of the incidentally truncated random
variable.
Because it has dominated the empirical literature,
we will focus …rst on the bivariate normal
distribution.
F. Guta (CoBE) Econ 607 February, 2019 113 / 139
The truncated joint density of y and z is
f (y, z )
f ( y, z j z > c ) =
Prob (z > c )
To obtain the incidentally truncated marginal
density for y, we would then integrate z out of this
expression.
The moments of the incidentally truncated normal
distribution are given in following theorem.
Theorem
6.4: Moments of the Incidentally Truncated Bivariate
F. Guta (CoBE) Econ 607 February, 2019 114 / 139
Theorem
Normal Distribution
If y and z have a bivariate normal distribution with
means µy and µz , standard deviations σy and σz ,
and correlation ρ, then
E [ y j z > c ] = µy + ρσy λ (αz ) ,
Var [ y j z > c ] = σ2y 1 ρ2 δ ( αz ) ,
where αz = (c µz ) /σz , λ (αz ) = φ (αz ) / [1 Φ (αz )],
and δ (αz ) = λ (αz ) [λ (αz ) αz ].
F. Guta (CoBE) Econ 607 February, 2019 115 / 139
As expected, the truncated mean is pushed in the
direction of the correlation if the truncation is from
below and in the opposite direction if it is from
above.
[Link] Regression in A Model of Selection
To motivate a regression model that corresponds to
the results in Theorem 6.4, we consider the
following example.
F. Guta (CoBE) Econ 607 February, 2019 116 / 139
Example
A Model of Labor Supply: A model of female labor supply that
has been examined in many studies consists of two equations:
1. Wage equation: The di¤erence between a person’s market
wage, what she could command in the labor market, and her
reservation wage, the wage rate necessary to make her choose
to participate in the labor market, is a function of
characteristics such as age and education, as well as number
of children and where a person lives.
F. Guta (CoBE) Econ 607 February, 2019 117 / 139
Example
2. Hours equation: The desired number of labor hours supplied
depends on the wage, home characteristics such as whether
there are small children present, marital status, and so on.
The problem of truncation surfaces when we consider that the
second equation describes desired hours, but an actual …gure
is observed only if the individual is working. (In most such
studies, only a participation equation, that is, whether hours
are positive or zero, is observable.)
F. Guta (CoBE) Econ 607 February, 2019 118 / 139
Example
We infer from this that the market wage exceeds the
reservation wage. Thus, the hours variable in the second
equation is incidentally truncated.
To put the preceding examples in a general
framework, let the equation that determines the
sample selection be
zi = wi0 γ + ui ,
and let the equation of primary interest be
F. Guta (CoBE) Econ 607 February, 2019 119 / 139
yi = xi0 β + ei .
The sampling rule is that yi is observed only when
zi is greater than zero.
Suppose as well that ei and ui have a bivariate
normal distribution with zero means and correlation
ρ.
Then we may insert these in Theorem 6.4 to obtain
the model that applies to the observations in our
sample:
F. Guta (CoBE) Econ 607 February, 2019 120 / 139
E [ yi j yi is observed] = E [ yi j zi > 0]
= E [ yi j ui > wi0 γ]
= xi0 β + E [ ei j ui > wi0 γ]
= xi0 β + ρσe λi (αu )
= xi0 β + βλ λi (αu ) ,
where αu = wi0 γ/σu and λ(αu ) = φ(wi0 γ/σu )/Φ(wi0 γ/σu ). So,
yi j zi > 0 = E [ yi j zi > 0] + υi
= xi0 β + βλ λi (αu ) + υi
F. Guta (CoBE) Econ 607 February, 2019 121 / 139
Least squares regression using the observed
data— for instance, OLS regression of hours on its
determinants, using only data for women who are
working— produces inconsistent estimates of β.
Once again, we can view the problem as an omitted
variable.
Least squares regression of y on x and λ would be a
consistent estimator, but if λ is omitted, then the
speci…cation error of an omitted variable is
committed.
F. Guta (CoBE) Econ 607 February, 2019 122 / 139
Finally, note that the second part of Theorem 6.4
implies that even if λi were observed, then least
squares would be ine¢ cient. The disturbance υi is
heteroscedastic.
The marginal e¤ect of the regressors on yi in the
observed sample consists of two components.
There is the direct e¤ect on the mean of yi , which is
β. In addition, for a particular independent variable,
if it appears in the probability that zi is positive,
F. Guta (CoBE) Econ 607 February, 2019 123 / 139
then it will in‡uence yi through its presence in λi .
The full e¤ect of changes in a regressor that
appears in both xi and wi on y is
∂E [ yi j zi > 0] ρσe
= βk γk δi ( αu )
∂xik σu
where δi = λ2i αi λi .
Suppose that ρ is positive and E [yi ] is greater when
zi is positive than when it is negative. Because
0 < δi < 1, the additional term serves to reduce the
marginal e¤ect.
F. Guta (CoBE) Econ 607 February, 2019 124 / 139
The change in the probability a¤ects the mean of yi
in that the mean in the group zi > 0 is higher.
In most cases, the selection variable z is not
observed. Rather, we observe only its sign.
In labour market participation example, we typically
observe only whether a woman is working or not
working. We can infer the sign of z , but not its
magnitude, from such information.
Because there is no information on the scale of z ,
F. Guta (CoBE) Econ 607 February, 2019 125 / 139
the disturbance variance in the selection equation
cannot be estimated.
Thus, we reformulate the model as follows:
selection mechanism: zi = wi0 γ + ui , zi = 1 if zi > 0 and 0 otherwise;
prob ( zi = 1j wi ) = Φ wi0 γ ; and
prob ( zi = 0j wi ) = 1 Φ wi0 γ (6.3.14)
regression model: yi = xi0 β + ei , observed only if zi = 1,
0 1 20 1 0 13
B ui C 6B 0 C B 1 ρσe C7
B C N6 B C,B C7
@ A 4@ A @ A5
ei 0 ρσe σ2e
F. Guta (CoBE) Econ 607 February, 2019 126 / 139
Suppose that, as in many of these studies, zi and wi
are observed for a random sample of individuals but
yi is observed only when zi = 1.
This model is precisely the one we examined earlier,
with
E [ yi j zi = 1, xi , wi ] = xi0 β + βλ λ (wi0 γ) .
[Link] Tow-Step and ML Estimation
The parameters of the sample selection model can
be estimated by maximum likelihood.
F. Guta (CoBE) Econ 607 February, 2019 127 / 139
However, Heckman’s (1979) two-step estimation
procedure is usually used instead. Heckman’s
method is as follows:
1). Estimate the probit equation by maximum likelihood
to obtain estimates of γ. For each observation in
the selected sample, compute
b i = φ (w 0 γ 0
b ) and b b i (λ
b i + w0 γ
λ i b ) /Φ (wi γ δi = λ i b ).
2). Estimate β and βλ = ρσe by least squares
b
regression of y on x and λ.
F. Guta (CoBE) Econ 607 February, 2019 128 / 139
It is possible also to construct consistent estimators
of the individual parameters ρ and σe .
At each observation, the true conditional variance of
the disturbance would be
σ2i = σ2e 1 ρ2 δi
The average conditional variance for the sample
would converge to
1 n 2
plim ∑ σi = σ2e 1 ρ2 δ
n i=
F. Guta (CoBE) Econ 607 February, 2019 129 / 139
which is what is estimated by the least squares
residual variance e 0 e/n.
For the square of the coe¢ cient on λ, we have
2
plimb
βλ = ρ2 σ2e ,
whereas based on the probit results we have
1 n b
plim ∑ δi = δ.
n i=
We can then obtain a consistent estimator of σ2e
using
F. Guta (CoBE) Econ 607 February, 2019 130 / 139
b2e = n1 e 0 e + b
2
σ δbβλ .
Finally, an estimator of ρ2 is
2
2
b
βλ
b
ρ = , (6.3.15)
b2e
σ
which provides a complete set of estimators of the
model’s parameters.
To test hypotheses, an estimate of the asymptotic
0
covariance matrix of [ b
β ,bβλ ] is needed.
We have two problems to contend with.
F. Guta (CoBE) Econ 607 February, 2019 131 / 139
First, we can see in Theorem 6.4 that the
disturbance term in
( yi j zi = 1, xi , wi ) = xi0 β + ρσe λi + υi (6.3.16)
is heteroscedastic;
Var [ υi j zi = 1, xi , wi ] = σ2e 1 ρ2 δi
Second, there are unknown parameters in λi .
Suppose that we assume for the moment that λi and δi are
known (i.e., we do not have to estimate γ).
F. Guta (CoBE) Econ 607 February, 2019 132 / 139
For convenience, let xi = [xi , λi ], and let β be the
least squares coe¢ cient vector in the regression of y
on x in the selected data.
Then, using the appropriate form of the variance of
ordinary least squares in a heteroscedastic model,
we would have to estimate
" #
n
∑
1 1
Var b
β = σ2e 0
X X 1 2
ρ δi xi xi 0
X0 X
i =1
1 1
= σ2e X0 X X0 I ρ2 ∆ X X0 X
F. Guta (CoBE) Econ 607 February, 2019 133 / 139
where I ρ2 ∆ is a diagonal matrix with (1 ρ2 δi )
0 0
on the diagonal and b
β = b
β ,bβλ .
Without any other complications, this result could
be computed fairly easily using X, the sample
estimates of σ2e and ρ2 , and the assumed known
values of λi and δi .
The parameters in γ do have to be estimated using
the probit equation. Rewrite (6.3.16) as
b i + υi
( yi j zi = 1, xi , wi ) = xi0 β + βλ λ bi
βλ λ λi
F. Guta (CoBE) Econ 607 February, 2019 134 / 139
In this form, we see that in the preceding expression
we have ignored both an additional source of
variation in the compound disturbance and
correlation across observations; the same estimate
b i for every observation.
of γ is used to compute λ
Heckman has shown that the earlier covariance
matrix can be appropriately corrected by adding a
term inside the brackets,
ρ 2 X0 ∆
Q =b b W Est. Asy. Var[γ bX
b ] W0 ∆ ρ2 F
=b bVbF
b0
F. Guta (CoBE) Econ 607 February, 2019 135 / 139
b = Est. [Link][γ
where V b ], the estimator of the
asymptotic covariance of the probit coe¢ cients.
Any of the following three estimators may be used
b
to compute V.
( ) 1 ( ) 1 ( ) 1
N N N
∑ Hi (γb ) , ∑ si (γb ) si (γb ) 0
, ∑ A (xi , γb )
i =1 i =1 i =1
where
φ (wi γ) wi0 [yi Φ (wi0 γ)]
si (γ) =
Φ (wi γ) [1 Φ (wi0 γ)]
F. Guta (CoBE) Econ 607 February, 2019 136 / 139
is the score vector of the conditional log likelihood
for observation i,
[φ (wi0 γ)]2 wi0 wi
E [ Hi (γ)j xi ] = A ( wi , γ )
Φ (wi0 γ) [1 Φ (wi0 γ)]
is the expected value of the Hessian matrix
conditional on xi .
The complete expression is
" #
n
1 1
[Link] b
β = b2e
σ 0
X X ∑X 0
I 2b
ρ ∆ X +Q
b X0 X
i =1
The sample selection model can also be estimated
F. Guta (CoBE) Econ 607 February, 2019 137 / 139
by maximum likelihood.
The full log-likelihood function for the data is built
up from
Prob(selection) densityj selection for observations with zi = 1,
and Prob(nonselection) for observations with zi = 0.
Combining the parts produces the full log-likelihood
function,
" !#
exp (1/2) e2i /σ2e ρei /σe + wi0 γ
ln L = ∑ ln p
σe 2π
Φ p
1 ρ2
+ ∑ ln 1 Φ wi0 γ ,
z =1 z =0
F. Guta (CoBE) Econ 607 February, 2019 138 / 139
where ei = yi xi β.
Note: The FIML estimator with its assumption of bivariate
normality is not less robust than the two-step
estimator because the latter also requires bivariate
normality to form the conditional mean for the
regression.
F. Guta (CoBE) Econ 607 February, 2019 139 / 139