0% found this document useful (0 votes)

8 views139 pages

6-Discrete Choice Models

The document provides an overview of discrete choice models in econometrics, focusing on binary choice models, multinomial models, and censored/truncated regression models. It discusses estimation and inference techniques, including linear probability, probit, and logit models, along with their applications and implications. Additionally, it covers the classification of dependent variables and the importance of understanding the underlying distributions for accurate modeling.

Uploaded by

Lemma Muleta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views139 pages

6-Discrete Choice Models

Uploaded by

Lemma Muleta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Addis Ababa University

College of Business and Economics

Department of Economics
Econ 607: Econometrics I
6. Discrete Choice Models

Fantu Guta Chemrie (PhD)

F. Guta (CoBE) CoBE
Econ 607 February, 2019 1 / 139
6. Discrete Choice Models (3 weeks)
6.1 Binary Choice Models: Estimation and Inference
6.1.1 Linear Probability Model
6.1.2 Probit and Logit Models
6.1.3 Reporting the Results for Probit and Logit Models

6.2 Multi-Response Models

6.2.1 Multinomial Logit Model

6.2.2 Conditional Logit Model
6.2.3 Multinomial Probit Model
6.2.4 Nested Logit Model
6.2.5 Ordered Response Model

F. Guta (CoBE) Econ 607 February, 2019 2 / 139

6.3 Censored and Truncated Regression Models
6.3.1 Tobit (Censored Regression) Model
6.3.2 Truncated Regression Model
6.3.3 Heckman’s Two Steps Sample Selection

F. Guta (CoBE) Econ 607 February, 2019 3 / 139

6. Discrete Choice Models

By discrete regression models we mean those

models in which the dependent variable assumes
discrete values.
The simplest of these models is that in which the
dependent variable y is binary (it can take on only
two values, say, 0 and 1).
Things are more complicated when the dependent
variable y can assume more than two values. Then
F. Guta (CoBE) Econ 607 February, 2019 4 / 139
we have to classify the cases into categorical and
noncategorical variables.

An example of a noncategorical variable is

encountered when y denotes the number of patents
issued to a company during a year. Here y assumes
values of 0, 1, 2, . . . , but is not a categorical
variable. However, it is a discrete variable.
Categorical variables can further be classi…ed as
unordered, sequential and ordered variables.
F. Guta (CoBE) Econ 607 February, 2019 5 / 139
Example of unordered categorical variable is:

y = 1 if the mode of transport is car

y = 2 if the mode of transport is bus

y = 3 if the mode of transport is train

Example of sequential categorical variable is:

y = 1 if the individual has not completed high school

y = 2 if the individual has completed high school but not college

y = 3 if the individual has completed college but not a higher degree

y = 4 if the individual has a professional degree

F. Guta (CoBE) Econ 607 February, 2019 6 / 139

Example of order categorical variable is:

y = 1 if the individual spends less $1,000

y = 2 if the individual spends more than $1,000 but less that $3,000

y = 3 if the individual spends more than $3,000 but less that $5,000

y = 4 if the individual spends more than $5,000

6.1 Binary Choice Models: Estimation and Inference

Modelling response of the form:

8
< 1 if the response of the i th individual is Yes
yi =
: 0 if the response of the i th individual is No
F. Guta (CoBE) Econ 607 February, 2019 7 / 139
Review of Bernoulli Trials: On each trial we have
Yes with probability θ and No with probability 1 θ
and the trials are independent.
Generalization of interest: θ i = θ (xi0 β), xi0 β is the
index for the i th individual.

E (yi ) = 1 θ + 0 (1 θ) = θ
var (yi ) = E yi2 [E (yi )]2
= 12 θ + 02 (1 θ) θ2
= θ (1 θ)
F. Guta (CoBE) Econ 607 February, 2019 8 / 139
Inference: Proportion of success in n independent
∑ni=1 yi
Bernoulli trials = n is the minimum variance
unbiased estimator of θ, also the MLE of θ.
The Likelihood function:
n
∏ θy
1 yi
L (θ; y1 , y2 , . . . , yn ) = i
(1 θ)
i =1
n
θ ) ∑ i =1 i
n
= θ ∑ i = 1 yi ( 1 n y

n x
= θ x (1 θ) if x = ∑ni=1 yi

ln L = (∑ni=1 yi ) ln θ + (n n
∑i =1 yi ) ln (1 θ)
F. Guta (CoBE) Econ 607 February, 2019 9 / 139
Now, it can easily be shown that the MLE of
b
θ= 1 n
∑i =1 yi .
n

Allow θ to be non-constant, i.e., θ i = θ (xi0 β).

The three popular speci…cations for this relationship:

1). θ i = xi0 β : Linear Probability Model, 0 θi 1

2). θ i = Φ (xi0 β) : a probit model if Φ is the
distribution function of N (0, 1).
3). θ i = F (xi0 β) : a logit model if F is the logistic
distribution function.
F. Guta (CoBE) Econ 607 February, 2019 10 / 139
6.1.1 Linear Probability Model

The term linear probability model is used to denote

a regression model in which the dependent variable
y is a binary variable taking the values of 1 if the
event occurs and 0 otherwise.
Examples include participation in the labour force,
the decision to marry, etc. We write the model in
the usual regression framework as:

yi = xi0 β + ui (6.1.1)
F. Guta (CoBE) Econ 607 February, 2019 11 / 139
The calculated value of y from the regression
equation ybi = xi0 b
β, will then give the estimated
probability that the event will occur given the
particular value of x.
In practice, these estimated probabilities can be
outside the admissible range (0, 1).
Because yi takes the value of 1 or 0, the residuals in
equation (6.1.1) can take only two values: 1 xi0 β
and xi0 β.

F. Guta (CoBE) Econ 607 February, 2019 12 / 139

Thus we have
ui f (ui )
1 xi0 β xi0 β
xi0 β (1 xi0 β)
Hence

2 2
Var (ui ) = (xi0 β) (1 xi0 β) + (1 xi0 β) (xi0 β)
= (xi0 β) (1 xi0 β)
= E (yi ) [1 E (yi )]

F. Guta (CoBE) Econ 607 February, 2019 13 / 139

Because of this heteroscedasticity problem, the OLS
estimate of β from (6.1.1) will not be e¢ cient.
Goldberger (1964, p. 250) suggested the following
procedure: First, estimate (6.1.1) by OLS. Next,
compute ybi (1 ybi ) and use weighted least squares;
i.e., de…ning
1/2
wi = [ybi (1 ybi )]

We regress yi /wi on xi /wi .

The problems with this procedure are the following:
F. Guta (CoBE) Econ 607 February, 2019 14 / 139
1). In practice, ybi (1 ybi ) may be negative, although in
large samples there is very small probability that
this will be so.
2). Because the residuals ui are not normally
distributed, the least squares method is not, in
general, fully e¢ cient.
3). The most important criticism is that the E ( y j x)
interpreted as the probability that the event will
occur can lie outside the limits (0, 1).

F. Guta (CoBE) Econ 607 February, 2019 15 / 139

6.1.2 Probit and Logit Models

An alternative approach, called by Goldberger

(1964) the Probit analysis model, is to assume that
there is an underlying response variable yi de…ned
by the regression relationship

yi = xi0 β + ui (6.1.2)

In practice, yi is unobservable. What we observe is

a dummy variable y de…ned by
F. Guta (CoBE) Econ 607 February, 2019 16 / 139
8
< 1 if y > 0
i
y= (6.1.3)
: 0 Otherwise

In this formulation, xi0 β is not E ( yi j xi ) as in the

linear probability model; it is E ( yi j xi ).
From the relations (6.1.2) and (6.1.3) we get

Prob (yi = 1) = Prob (ui > xi0 β)

= 1 F ( xi0 β) (6.1.4)

where F is the cumulative distribution function for

u.
F. Guta (CoBE) Econ 607 February, 2019 17 / 139
Hence the Likelihood function is

L= ∏F( xi0 β) ∏ [1 F ( xi0 β)] (6.1.5)

yi = 0 yi = 1

The functional form of F is (6.1.5) will depend on

the assumptions made about ui in (2.1.2).
If the cumulative distribution of ui is logistic, we
have the logit model. In this case,
exp ( xi0 β) 1
F( xi0 β) = 0 =
1 + exp ( xi β) 1 + exp (xi0 β)
Hence,
F. Guta (CoBE) Econ 607 February, 2019 18 / 139
exp(xi0 β )
1 F ( xi0 β) = (6.1.6)
1+exp(xi0 β )

In this case we say that there is a closed-form

expression of F , because it does not involve
integrals explicitly. Not all distributions permit such
a closed-form expression.
For instance, in the probit model (or, more
accurately, the normal model) we assume that ui are
IN (0, σ2 ). In this case,
Z xi0 β/σ 1 tt
F( xi0 β) = p exp dt (6.1.7)
∞ 2π 2
F. Guta (CoBE) Econ 607 February, 2019 19 / 139
It can easily be seen from (6.1.7) and the likelihood
function (6.1.5) that we can estimate only β/σ,
and not β and σ separately. Hence we might
assume σ = 1 to start with.
Because the cumulative normal distribution and the
logistic distribution are very close to each other,
except at the tails, we are not likely to get very
di¤erent results using logit or probit method.
However, the estimates of β from the two methods
are not directly comparable.
F. Guta (CoBE) Econ 607 February, 2019 20 / 139
Because the logistic distribution has a variation
π 2 /3, the estimates of β from the logit model have
p
3
to be multiplied by π to be comparable to the
estimates obtained from from the probit model
(where we normalize σ = 1).
Amemiya (1981) suggested that the logit estimates
p
3
be multiplied by 1/1.6 = 0.625, instead of π

saying that this transformation produces a closer

approximation between the logistic distribution and

F. Guta (CoBE) Econ 607 February, 2019 21 / 139

the distribution function of the standard normal.
He also suggested that the coe¢ cients of the linear
probability model b
βLP and the coe¢ cients of the
logit model, b
βL are related by the relationships

b
βLP ' 0.25b
βL except for the constant term
b
βLP ' 0.25b
βL + 0.5 for the constant term

Thus, if we need to make b

βLP comparable to the
probit coe¢ cients, we need to multiply them by 2.5
and subtract 1.25 from the constant term.
F. Guta (CoBE) Econ 607 February, 2019 22 / 139
An alternative way of comparing the models would
be to

a). Calculate the sum of squared deviations from the

predicted probabilities,
b). Compare the percentages correctly predicted, and
c). Look at the derivatives of the probabilities with
respect to a particular independent variable.

Let xik be the k th element of the vector of

explanatory variables xi , and let βk be the k th
F. Guta (CoBE) Econ 607 February, 2019 23 / 139
element of β. Then the derivatives for the
probabilities given by the linear probability, probit
and logit models are, respectively
∂
[xi0 β] = βk
∂xik
∂
Φ (xi0 β) = φ (xi0 β) βk
∂xik
∂ 0 exp (xi0 β)
L (xi β) = βk
∂xik [1 + exp (xi0 β)]2
These derivative will be needed for predicting the
e¤ects of changes in one of the independent
F. Guta (CoBE) Econ 607 February, 2019 24 / 139
variables on the probability of belonging to a group.

The likelihood function (6.1.5) can be written as

!1 yi !y i
n
1 exp xi0 β
L = ∏ 1 + exp xi0 β 1 + exp xi0 β
(6.1.8)
i =1
exp β0 ∑ni=1 xi yi
= (6.1.9)
∏ni=1 1 + exp β0 xi

De…ne t = ∑ni=1 xi yi . To …nd the

maximum-likelihood (ML) estimate of β, we have
n
0
βt ∑ ln 1 + exp β0 xi
i =1

Hence, ∂ ln L/∂β = 0 gives

F. Guta (CoBE) Econ 607 February, 2019 25 / 139
n exp( β0 xi )
S ( β) = ∑i =1 1+exp β0 x xi +t = 0 (6.1.10)
( i)
These equations are nonlinear in β. Hence, we have
to use Newton-Raphson method or the scoring
method to solve the equations.

The information Matrix is

∂2 ln L
I ( β) = E
∂β∂β0
n exp β0 xi
= ∑ 0
x x0
2 i i
(6.1.11)
i =1 1 + exp β xi

F. Guta (CoBE) Econ 607 February, 2019 26 / 139

Starting with some initial value of β, say β0 ,
compute the values of S ( β0 ) and I ( β0 ). Then the
new estimate of β is, by the method of scoring,

1
β1 = β0 + [I ( β0 )] S ( β0 )

In practice, we divide both S ( β0 ) and I ( β0 ) by n,

the sample size. This iterative procedure is repeated
until convergence.
I ( β) is positive de…nite at each stage of iteration.
F. Guta (CoBE) Econ 607 February, 2019 27 / 139
Hence, the iterative procedure will converge to a
maximum of the likelihood function, no matter the
starting value is.
The …nal converged estimates are denoted by b
β,
then the asymptotic covariance matrix is estimated
h i 1
by I β b .
These estimated variances and covariances will
enable us to test hypothesis about the di¤erent
elements of b
β.

F. Guta (CoBE) Econ 607 February, 2019 28 / 139

After estimating b
β, we can get estimated values of
the probability that the i th observation is equal to 1.
Denoting these estimated values by b
pi , we have
0
exp b
β xi
b
pi = 0
1 + exp b
β xi

Equation (6.1.10) shows that

n
pi xi = ∑ni=1 yi xi
∑ i =1 b (6.1.12)

Thus, if xi includes a constant term, then the sum

F. Guta (CoBE) Econ 607 February, 2019 29 / 139
of the estimated probabilities is equal to the number
of observations in the sample for which yi = 1, i.e.,
n
pi = ∑ni=1 yi .
∑ i =1 b

In other words, the predicted frequency is equal to

the actual frequency.
Similarly, if xi includes a dummy variable, say 1 for
male, 0 for female, then the predicted frequency will
be equal to the actual frequency for each sex group.
Similar conclusions follow for the linear probability
F. Guta (CoBE) Econ 607 February, 2019 30 / 139
model by virtue of the fact that (6.1.12) are the
least squares normal equations in that case.

After estimating b
β and then b
pi by the logit model, it
is always good practice to check whether or not
equations (6.1.12) are satis…ed.
Let us denote by φ ( ) and Φ ( ) the density
function and the distribution function, respectively,
of the standard normal. Then for the probit model
the likelihood function corresponding to (6.1.8) is
F. Guta (CoBE) Econ 607 February, 2019 31 / 139
n
L = ∏i =1 Φ β0 xi
yi 1 yi
1 Φ β0 xi

and the log-likelihood is

n n
log L = ∑ yi log Φ β0 xi + ∑ [1 yi ] log 1 Φ β0 xi
i =1 i =1

Di¤erentiating log L with respect to β yields

∂ log L n yi Φ β0 xi
S ( β) = =∑ 0 φ β0 xi xi
∂β i =1 Φ β xi 1 Φ β0 xi

The ML estimator b
βML can be obtained as a
solution of the equations S b
βML = 0.
F. Guta (CoBE) Econ 607 February, 2019 32 / 139
These equations are nonlinear in β, and thus we
have to solve them by an iterative procedure.

The information matrix is

∂2 log L
I ( β) = E
∂β∂β0
2
n φ β0 xi
= ∑Φ 0
β xi 1 0
Φ β xi
xi xi0
i =1
As with the logit model, we start with an initial
value of β, say β0 , and compute the values S ( β0 )
and I ( β0 ).
F. Guta (CoBE) Econ 607 February, 2019 33 / 139
Then the new estimate of β is, by the method of
scoring,

1
β1 = β0 + [I ( β0 )] S ( β0 )

Note that I ( β) is positive de…nite at each stage of

iteration.
Hence, the iterative procedure will converge to a
maximum of the likelihood function, no matter the
starting value is.

F. Guta (CoBE) Econ 607 February, 2019 34 / 139

The …nal converged estimates are denoted by b
β,
then the asymptotic covariance matrix is estimated
h i 1
by I β b .
These can be used to conduct any tests of
signi…cance.

6.1.3 Reporting the Results for Probit and Logit

Several statistics should be reported routinely in any

probit or logit (or other binary choice) analysis.

F. Guta (CoBE) Econ 607 February, 2019 35 / 139

The b
βj , their standard errors, and the value of the
loglikelihood function are reported by software
packages that do binary response analysis.
The b
βj give the signs of the partial e¤ects of the xj
on the response probability, and the statistical
signi…cance of xj is determined by whether we can
reject H0 : βj = 0.
One measure of goodness of …t that is sometimes
reported is the percent correctly predicted.

F. Guta (CoBE) Econ 607 February, 2019 36 / 139

The easiest way to describe this statistic is to de…ne
a binary predictor of yi to be one if the predicted
probability is at least 0.5 and zero otherwise.
More precisely, de…ne the binary variable yei = 1 if
b
F xi β b < 0.5. Given
0.5 and yei = 0 if F xi β
fyei : i = 1, 2, . . . , N g, we can see how well yei
predicts yi across all observations.
There are four possible outcomes on each pair,
(yi , yei ); when both are zero or both are one, we
make the correct prediction.
F. Guta (CoBE) Econ 607 February, 2019 37 / 139
In the two cases where one of the pair is zero and
the other is one, we make incorrect prediction. The
percent correctly predicted is the percent of times
that yei = yi . (This goodness-of-…t measure can be
computed for the linear probability model, too.)
Some have criticized the prediction rule described
above for always using a threshold value of 0.5. One
alternative is to use the fraction of success in the
sample as the threshold.

F. Guta (CoBE) Econ 607 February, 2019 38 / 139

Another possibility is to choose the threshold such
that the fraction of yei = 1 in the sample is the same
(or very close) to y. In other words, search over
threshold values τ, 0 < τ < 1, such that if we
b
de…ne yei = 1 when F xi β τ, then
n N
∑i =1 yei ∑i =1 yi .
McFadden (1974) suggests pseudo-R-squared
measure for binary response given by 1 `ur /`0 ,
where `ur is the log-likelihood function for the

F. Guta (CoBE) Econ 607 February, 2019 39 / 139

estimated model and `0 is the log-likelihood
function in the model with only an intercept.

Because the log-likelihood for binary response model

is always negative, j`ur j j`0 j , and so the
pseudo-R-squared is always between zero and one.
Alternatively, we can use a sum of squared residuals
measure: 1 SSRur /SSR0 , where SSRur is the sum
of squared residuals b
ui = yi b and SSR0 , is
G xi β
the total sum of squares of yi .
F. Guta (CoBE) Econ 607 February, 2019 40 / 139
Several other measures have been suggested (see,
for example, Maddala, 1983, Chap.2), but goodness
of …t is not as important as statistical and economic
signi…cance of the explanatory variables.
Usually we want to estimate e¤ects of the variable
xj on the response probability P ( y = 1j x).
If xj is (roughly) continuous, then
h i
\
∆ P ( y = 1j x ) b b
f x β βj ∆xj (6.1.13)

for small changes in xj .

F. Guta (CoBE) Econ 607 February, 2019 41 / 139
Therefore, the estimated partial e¤ect of a
continuous variable on the response probability,
evaluated at x, is given by f xb
β b
βj .
b depends on x, we need to decide
Because f x β
which partial e¤ects to report.
Often the sample averages of the xj are plugged in
b , with x 1 = 1 because we include a
to get f x β
constant. We call the resulting partial e¤ect the
partial e¤ect at the average (PEA).
F. Guta (CoBE) Econ 607 February, 2019 42 / 139
PEA have drawbacks. First, it need not represent
the partial e¤ect for any particular unit in the
population.
Another issue is that, if x contains nonlinear
functions of underlying variables, such as
logarithms, we must decide whether to use the
average of the nonlinear function or the nonlinear
function of the average.
The latter has some appeal, but software packages

F. Guta (CoBE) Econ 607 February, 2019 43 / 139

(such as Stata, with its mfx, for "marginal e¤ects,"
command) use the average of the nonlinear
functions (because one must create the nonlinear
functions before including them in logit or probit).

If two or more elements of x are functionally related,

such as quadratics or interactions, it is not even
clear what the PEAs of individual coe¢ cients mean.
For example, suppose xK 1 = age and xK = age 2 .
Then the reported PEAs for age and age 2 are
F. Guta (CoBE) Econ 607 February, 2019 44 / 139
b b
f xβ βK b b
and f x β βK , respectively, where
1

x = 1, x 2 , . . . , x K 2
2 , age, age .

These PEAs do not tell us what we want to know

about the partial e¤ect of age on P ( y = 1j x). For
any x, the estimated partial e¤ect is
b
f xβ b
βK + 2b
βK age . Now, we might be
1

interested in evaluating this partial e¤ect at the

mean values, but that would entail using age 2 ,
rather than age 2 , inside f ( ).
F. Guta (CoBE) Econ 607 February, 2019 45 / 139
If we are really interested in the e¤ect of age on the
response probability, we might want to evaluate the
partial e¤ect at several di¤erent values of age,
perhaps evaluating the other explanatory variables
at their means.
For discrete variables, it is known that the average
need not even be a possible outcome of the
variable. For example, if x2 = female is a gender
dummy, then the PEA is the partial e¤ect when

F. Guta (CoBE) Econ 607 February, 2019 46 / 139

female is replaced with the fraction of women in the
sample.

One way to overcome this conceptual problem is to

compute the partial e¤ects separately for x2 = 1
and x2 = 0.
Standard errors of the partial e¤ects in equation
(6.1.13) can be obtained using the delta method.
Consider the case j = K , and for a given x, de…ne
δK = βK f (xβ) = ∂P ( y = 1j x) /∂xK .
F. Guta (CoBE) Econ 607 February, 2019 47 / 139
Write this relation as δK = h ( β) to denote that
this is a (nonlinear) function of the vector β. We
assume x1 = 1. The gradient of h ( β) is

df df df
r β h ( β ) = βK ( x β ) , β K x2 ( x β ) , . . . , β K xK (x β ) + f (x β )
d β1 d β2 d βK

The delta method implies the asymptotic variance

of b
δK is estimated as

b r β h ( β) 0
r β h ( β) V (6.1.14)

b
b is the asymptotic variance estimate of β.
where V
F. Guta (CoBE) Econ 607 February, 2019 48 / 139
The asymptotic standard error of b
δK is the square
root of the expression (6.1.14). Stata does this
calculation for logit and probit using the mfx
command.
If xK is a discrete variable, then we can estimate the
change in the predicted probabilities in going from
cK to cK + 1 as
h i
b = F bβ1 + b
β2 x 2 + +b +b
βK (cK + 1 )
δK βK 1x K 1

h i
F bβ1 + b
β2 x 2 + +b
βK 1x K 1 +b
βK cK (6.1.15)

F. Guta (CoBE) Econ 607 February, 2019 49 / 139

An alternative way to summarize the estimated
marginal e¤ects is to estimate the average value of
βK f (xβ) across the population, or βK E [f (xβ)].
This quantity is the average partial e¤ect (APE).
A consistent estimator of the APE is
" #
n
β n 1 ∑ f xi β b
K (6.1.16)
i =1

when xK is continuous or
n h i
n 1
∑ F b
β1 + b
β2 x i 2 + +b
βK 1 x i ,K 1 +b
βK F b
β1 + b
β2 x i 2 + +b
βK 1 x i ,K 1 (6.1.17)
i =1

when xK is binary.
F. Guta (CoBE) Econ 607 February, 2019 50 / 139
If some elements of x are functions of each other,
obtaining APEs of the form in equation (6.1.17) is
not useful. If, say, xK 1 = age and xK = age 2 , we
can estimate the APE of age by averaging the
individual partial e¤ects
b
βK + 2b
βK agei b , across i.
f xi β
1

Again, it probably makes more sense to evaluate the

partial e¤ect at di¤erent values of age and then to
average these across the other variables, say

F. Guta (CoBE) Econ 607 February, 2019 51 / 139

n
∑ b 2
n 1
βK 1 + 2b
βK age 0 f b
β1 + b
β2 x i 2 + +b
βK 2 x i ,K 1 +b
βK 1 age
0
+b
βK (age 0 )
i =1

for a given value of age 0 .

6.2 Multi-response Models

Multinomial response models are unordered discrete

response models with more than two outcomes.
Unordered choice models can be motivated by a
random utility model. For the i th consumer faced
with J + 1 choices, suppose that the utility of
choice j is
F. Guta (CoBE) Econ 607 February, 2019 52 / 139
Uij = zij0 θ + eij (6.2.1)

If the consumer makes choice j in particular, then

we assume that Uij is the maximum among the
J + 1 utilities. Hence, the statistical model is driven
by the probability that choice j is made, which is

Prob(Uij > Uik ) for all other k 6= j.

The model is made operational by a particular

choice of distribution for the disturbances.
F. Guta (CoBE) Econ 607 February, 2019 53 / 139
As in the binary choice case, two models are usually
considered, logit and probit.
Because of the need to evaluate multiple integrals
of the normal distribution, the probit model has
found rather limited use in this setting.
The logit model, in contrast, has been widely used
in many …elds, including economics.

6.2.1 Multinomial Logit Model

This model applies when a unit’s response or choice

F. Guta (CoBE) Econ 607 February, 2019 54 / 139
depends on individual characteristics of the unit
but not on attributes of the choices.

Let y denote a random variable taking on the values

f0, 1, . . . , J g for J a positive integer, and let x
denote a set of conditioning variables.
For example, if y denotes occupational choice, x
can contain things like education, age, gender, race,
and marital status. As usual, (xi , yi ) is a random
draw from the population.
F. Guta (CoBE) Econ 607 February, 2019 55 / 139
As in the binary response case, we are interested in
how changes in the elements of x a¤ect the
probabilities of response, P (y = j jx ) , j = 0, 1, . . . , J.

Let x be a 1 K vector with …rst-element equal to

unity. The multinomial logit (MLN) model has

response probabilities

exp x βj
P (y = j jx ) = , j = 1, . . . , J (6.2.2)
1 + ∑Jh =1 exp (x βh )

where βj is K 1, j = 1, . . . , J.
F. Guta (CoBE) Econ 607 February, 2019 56 / 139
Because the response probabilities must sum to 1,
.h i
1 + ∑h =1 exp (x βh )
J
P (y = 0 jx ) = 1

The partial e¤ects for this model are complicated.

For continuous xk , we can write
( )
∂P (y = j jx ) ∑Jh =1 βhk exp (x βh )
= P (y = j jx ) βjk (6.2.3)
∂ xk g (x, β)

where βhk is the k th element of βh and

g (x, β) = 1 + ∑h=1 exp (xβh )

F. Guta (CoBE) Econ 607 February, 2019 57 / 139

Equation (6.2.3) shows that even the direction of

the e¤ect is not determined entirely by βjk . A

simpler interpretation of βj is given by

pj (x, β) /p0 (x, β) = exp x βj , j = 1, . . . , J (6.2.4)

where pj (x, β) denotes the response probability in

(6.2.2).
Thus the change in pj (x, β) /p0 (x, β) is approximately
βjk exp x βj ∆xk for roughly continuous xk .
F. Guta (CoBE) Econ 607 February, 2019 58 / 139
Equivalently, the log-odds ratio is linear in x:
log pj (x, β) /p0 (x, β) = x βj extends to general j and h:

log pj (x, β) /ph (x, β) = x βj βh .

Here is another useful fact about the multinomial

logit model. Since

P (y = j or y = h jx ) = pj (x, β) + ph (x, β) ,

pj (x, β)
P (y = j jy = j or y = h, x ) =
pj (x, β) + ph (x, β)
h i
= Λ x βj βh
F. Guta (CoBE) Econ 607 February, 2019 59 / 139
where Λ ( ) is the logistic function.

In other words, conditional on the choice being

either j or h, the probability that the outcome is j
follows a standard logit model with parameter
vector βj βh .
Since we have fully speci…ed the density of y given
x, estimation of the MNL model is best carried out
by maximum likelihood. For each i the conditional
log likelihood can be written as
F. Guta (CoBE) Econ 607 February, 2019 60 / 139
`i ( β) = ∑Jj=0 1 [yi = j ] log [pj (xi , β)]
where the indicator function selects out the
appropriate response probability for each
observation i.

As usual, we must estimate β by maximizing

n
∑i =1 `i ( β). McFadden (1974) has shown that the
log-likelihood function is globally concave, and this
fact makes the maximization problem
straightforward.
F. Guta (CoBE) Econ 607 February, 2019 61 / 139
6.2.2 Conditional Logit Model (Due to MacFadden)

Suppose the J + 1 disturbances are independent

and identically distributed with Gumbel (type 1
extreme value) distributions,

F (eij ) = exp f exp ( eij )g

Suppose also that an individual faces J + 1 choices then

when the data consist of choice-speci…c attributes instead of

individual-speci…c characteristics, the natural model

F. Guta (CoBE) Econ 607 February, 2019 62 / 139
formulation would be

exp (xij β)
P (yi = j jxi ) = J
, j = 0, . . . , J (6.2.5)
∑h =0 exp (xih β)
The response probability in equation (6.2.5)
constitute what is usually called the conditional
logit (CL) model.
Dropping the subscript i and di¤erentiating shows
that the marginal e¤ects are given by
∂pj (x)
= pj (x) 1 pj (x) βk , j = 0, . . . , J, k = 1, . . . , K (6.2.6)
∂xjk
F. Guta (CoBE) Econ 607 February, 2019 63 / 139
and

∂pj (x)
= pj (x) ph (x) βk , j 6= h, k = 1, . . . , K (6.2.7)
∂xhk

where pj (x) is the response probability in (6.2.5)

and βk is the k th elements of β.

The CL and MNL model have similar response

probabilities, but they di¤er in some important
respects.
In the MNL model, the conditioning variables do
F. Guta (CoBE) Econ 607 February, 2019 64 / 139
not change across alternatives: for each i, xi
contains variables speci…c to the individual but not
to alternatives.

This model is appropriate for problems where

characteristics of the alternatives are unimportant or
are not of interest, or where the data are not
available.
The CL model is intended speci…cally for problems
where consumer or …rm choices are at least partly
F. Guta (CoBE) Econ 607 February, 2019 65 / 139
based on observable attributes of each alternative.

The CL model is very important for modelling

probabilistic choice, but has some limitations. An
important restriction is

pj (x) exp (xj β)

= = exp [(xj xh ) β] (6.2.8)
ph (x) exp (xh β)

so relative probabilities for any two alternatives

depend only on the attributes of those two
alternatives.
F. Guta (CoBE) Econ 607 February, 2019 66 / 139
This is called the independence from irrelevant
alternatives (IIA) assumption because it implies
that adding another alternative or changing the
characteristics of a third alternative does not a¤ect
the relative odds between alternative j and h.
This implication is often implausible, especially for
applications with similar alternatives.
Several models that relax the IIA assumption have
been suggested.

F. Guta (CoBE) Econ 607 February, 2019 67 / 139

In the context of the random utility model, the IIA
assumption comes about because the
feij : j = 0, 1, . . . , . . . , J g are assumed to be
independent Weibull random variables.

6.2.3 Multinomial Probit Model.

A more ‡exible assumption is that ei has a

multivariate normal distribution with arbitrary
correlations between eij and eih , all j 6= h.

F. Guta (CoBE) Econ 607 February, 2019 68 / 139

The resulting model is called the multinomial
probit model. (Conditional probit model is a
better name)
Theoretically, the multinomial probit is attractive,
but computationally di¢ cult.
The response probabilities are very complicated,
involving a (J + 1) dimensional integrals.
This complexity not only makes it di¢ cult to obtain
the partial e¤ects on the response probabilities, it
F. Guta (CoBE) Econ 607 February, 2019 69 / 139
also makes MLE infeasible for more than about …ve
alternatives.

However, recent advances in estimation through

simulation make multinomial probit estimation
feasible for many alternatives.

6.2.4 Nested Logit Model (NLM)

NLM is the most popular hierarchal modeling

approach to relaxing IIA.
F. Guta (CoBE) Econ 607 February, 2019 70 / 139
We illustrate the basic approach where there are
only two hierarchies.
Suppose that the total number of alternatives can
be put into S groups (community choice) of similar
alternatives, and let Gs denote the alternatives (type
of dwelling within communities) within groups.
The …rst hierarchy corresponds to which of the S
groups y falls into, and the second corresponds to
the actual alternative within each group.

F. Guta (CoBE) Econ 607 February, 2019 71 / 139

MacFadden (1981) studied the model
h i ρs
αs ∑j 2G s exp ρs 1 xj0 β
P ( y 2 Gs j x) = h i ρr (6.2.9)
∑Sr=1 αr ∑j 2G r exp ρr 1 xj0 β

exp ρs 1 xj0 β
and P ( y = j j y 2 Gs , x) = (6.2.10)
∑h 2G r exp ρr 1 xh0 β

where equation (6.2.9) is de…ned for s = 1, 2, . . . , S

while equation (6.2.10) is de…ned for j 2 Gs and
s = 1, 2, . . . , S.

Of course, if j 2
/ Gs , P( y = j j y 2 Gs , x) = 0.
F. Guta (CoBE) Econ 607 February, 2019 72 / 139
This model requires a normalization restriction,
usually α1 = 1.
Equation (6.2.9) gives the probability that the
outcome is in groups s (conditional on x); then,
conditional on y 2 Gs , equation (6.2.10) gives the
probability of choosing alternative j within Gs .
The response probability P( y = j j x), which is
ultimately of interest, is obtained by multiplying
equations (6.2.9) and (6.2.10).

F. Guta (CoBE) Econ 607 February, 2019 73 / 139

This model can be derived by specifying the joint
distribution for the disturbances in equation (6.2.1).
Equation (6.2.11) implies that, conditional on
choosing group s, the response probabilities take a
CL form with parameter vector ρs 1 β.
This suggest a natural two-step procedure. First
estimate λs = ρs 1 β, s = 1, 2, . . . , S, by applying
CL analysis separately to each of the groups.
b s into equation (6.2.9) and
Then, plug the λ
F. Guta (CoBE) Econ 607 February, 2019 74 / 139
estimate αs , s = 1, 2, . . . , S by maximizing the log-
likelihood function
n S
∑ ∑ 1 [yi 2 Gs ] log [qs (xi ; λ, α, ρ)]
i =1 s =1

where qs (x; λ, α, ρ) is the probability in equation

(6.2.9) with λs = ρs 1 β.

This two-step conditional MLE is consistent and

p
n asymptotically normal under general regularity
conditions.
F. Guta (CoBE) Econ 607 February, 2019 75 / 139
Of course, we can also use full MLE. The log-
likelihood for observation i can be written as:
( )!
S
`i ( β, α, ρ ) = ∑ 1 [yi 2 G s ] log [q s (xi ; β, α, ρ)] + ∑ 1 [yi = j ] log [p sj (xi ; β, ρs )]
s =1 j 2G s

where qs (xi ; β, α, ρ) is the probability in equation

(6.2.9) and psj (xi ; β, ρs ) is the probability in
equation (6.2.10).
The regularity condition for MLE are satis…ed under
weak assumptions.
When αs = 1 and ρs = 1 for all s, the nested logit
model reduces to the Econ
F. Guta (CoBE)
CL607model. February, 2019 76 / 139
Thus, a test of IIA (as well as the other
assumptions underlying the CL model) is a test of
H0 : α2 = = αS = ρ1 = = ρS = 1.
MacFadden (1987) suggests a score (LM) test,
which only requires estimation of the CL model.

6.2.5 Ordered Logit and Order Probit Models

Let y be an ordered response taking on the values

f0, 1, . . . , J g for some known integer J. The
ordered probit model for y (conditional on
F. Guta (CoBE) Econ 607 February, 2019 77 / 139
explanatory variables x) can be derived from a
latent variable model.

Assume that a latent variable y is determined by

y = xβ + e, e jx N (0, 1) (6.2.11)

where β is a K 1 and, for reasons to be seen, x

does not contain a constant.

Let α1 < α2 < < αJ be unknown cut points

(or threshold parameters), and de…ne
F. Guta (CoBE) Econ 607 February, 2019 78 / 139
y =0 if y α1
y =1 if α1 < y α2
..
. (6.2.12)
y = J if y > αJ

For example, if y takes on values 0,1, and 2, then

there are two cut points, α1 and α2 .
Given the standard normal assumption for e, it is
easy to derive the conditional distribution of y given
x; we compute each response probability:
F. Guta (CoBE) Econ 607 February, 2019 79 / 139
P (y = 0 jx ) = P (y α1 jx ) = P (x β + e α1 jx ) = Φ ( α1 x β)

P (y = 1 jx ) = P ( α1 < y α2 jx ) = Φ ( α2 x β) Φ ( α1 x β)
..
. (6.2.13)

P (y = J 1 jx ) = P ( αJ 1 <y αJ jx ) = Φ ( αJ x β) Φ ( αJ 1 x β)

P (y = J jx ) = P (y > αJ jx ) = 1 Φ ( αJ x β)

One can easily verify that these sum to unity. When

J = 1 we get the binary response model:

P (y = 1 jx ) = 1 P (y = 0 jx ) = 1 Φ ( α1 x β ) = Φ (x β α1 )

and so α1 is the intercept inside Φ.

F. Guta (CoBE) Econ 607 February, 2019 80 / 139
It is for this reason that x does not contain an
intercept in the ordered probit model.
When there are only two outcomes, we set the
single cut point to zero and estimate the intercept;
this approach leads to the standard probit model.
The parameters α and β can be estimated by MLE.
For each i, the log likelihood function is

`i ( α, β ) = 1 [yi = 0] log [Φ (α1 xi β)] + 1 [yi = 1] log [Φ (α2 xi β)

Φ ( α1 xi β)] + + 1 [yi = J ] log [1 Φ ( αJ xi β)] (6.2.14)

F. Guta (CoBE) Econ 607 February, 2019 81 / 139

This log likelihood function is well behaved, and
many statistical packages estimate ordered probit
models.
Other distribution functions can be used in place of
Φ. Replacing Φ with the logit function, Λ, gives
the ordered logit model.
In either case we must remember that β, by itself, is
of limited interest. In most most cases we are not
interested in E (y jx ), as y is an abstract
construct.
F. Guta (CoBE) Econ 607 February, 2019 82 / 139
Instead, we are interested in the response

probabilities P (y = j jx ), just as in the ordered

response case. For the ordered probit model

∂p0 (x) ∂pJ (x)

= βk φ ( α1 x β ) , = βk φ ( αJ x β )
∂xk ∂xk
∂pj (x)
= βk φ αj x β φ α j +1 x β , 0 < j < J
∂xk

and the formulas for the ordered logit model are

similar.

F. Guta (CoBE) Econ 607 February, 2019 83 / 139

Estimated response probabilities at various values of
x, such as x, can be compared for logit and probit
models.
The b
β are not directly comparable across models. In
particular, the b
αj are important determinant of the
magnitude of the estimated probabilities and partial
e¤ects.
While the direction of the e¤ect of xk on the
probabilities P (y = 0 jx ) and P (y = J jx ) is

F. Guta (CoBE) Econ 607 February, 2019 84 / 139

unambiguously determined by the sign of βk , the
sign of βk does not always determine the direction
of the e¤ect for the intermediate outcomes,
1, 2, . . . , J 1.

6.3 Censored and Truncated Regression Models

Suppose y has a normal distribution, with mean µ

and variance σ2 . Suppose we consider a sample of
size n fy1 , y2 , . . . , yn g and record only those values
of y greater than a constant c.
F. Guta (CoBE) Econ 607 February, 2019 85 / 139
For those values of y c, we record the value c.
The abbreviations are

yi = yi if yi > c
yi = c otherwise

The resulting sample y1 , y2 , . . . , yn is said to be a

censored sample. For the observations yi = c all we
know is that y c, that is,

P (yi = c ) = P (y c)
F. Guta (CoBE) Econ 607 February, 2019 86 / 139
Hence, the likelihood function for estimation of the

parameters µ and σ2 is the hybrid of the normal

regression and the probit model given by

1 yi µ c µ
L µ, σ2 y1 , , yn = ∏ σ
φ
σ ∏ Φ
σ
yi >c yi c

where φ ( ) and Φ ( ) are,respectively,the density

and the distribution function of the standard normal.
Theorem
6.1: Moments of the Censored Normal Variable:
F. Guta (CoBE) Econ 607 February, 2019 87 / 139
Theorem
If y N (µ, σ2 )] and y = c if y c or else
y = y , then

E (y ) = cΦ + (µ + σλ) (1 Φ)
h i
2
var (y ) = σ (1 2
Φ ) (1 δ) + (α λ) Φ

where

Φ [(c µ) /σ] = Φ (α) = P (y c) = Φ

λ = φ/ (1 Φ) , and δ = λ2 λα
F. Guta (CoBE) Econ 607 February, 2019 88 / 139
Now suppose that before the sample is drawn we
truncate the distribution of y at the point y = c,
so that no observations are drawn for yi > c.
All observations come from the shaded are in …gure
6.1 below.

Figure 6.1 Truncated normal distribution

F. Guta (CoBE) Econ 607 February, 2019 89 / 139

The density function of the truncated normal

distribution from which the sample is drawn is

(1/σ) φ [(y µ) /σ]

f (y j y < c) = ( ∞<y c)
Φ [(c µ) /σ ]

A sample from this truncated normal distribution is

called a truncated sample.
In practice we can have samples that are doubly
truncated, doubly censored, truncated and censored,
and so forth.
F. Guta (CoBE) Econ 607 February, 2019 90 / 139
Example
Consider a truncation at the level c1 and censoring
at the level c2 (c2 < c1 ); that is, only samples of y
with y c1 are drawn, and among those sample
only value of y > c2 are recorded.
For those observations y c2 , we record c2 ; that is

yi = yi if y > c2
yi = c2 otherwise

The likelihood function for this model is

F. Guta (CoBE) Econ 607 February, 2019 91 / 139
Example
h i n
c1 µ yi c2 µ
L µ, σ 2 y1 , , yn = Φ 1
Φ
µ
σ ∏y i >c 2 σ φ σ ∏y i c2 σ

Theorem
6.2: Moments of the Truncated Normal Distribution:

If x N (µ, σ2 )] and c is a constant, then

E (x jtruncation) = µ + σλ(α) (6.3.1)

Var (x jtruncation) = σ2 [1 δ(α)] (6.3.2)

where

F. Guta (CoBE) Econ 607 February, 2019 92 / 139

Theorem
α = (c µ)/σ, φ(α) is the standard normal density and

λ ( α ) = φ ( α ) / [1 Φ(α)] if truncation is x > c (6.3.3a)

λ(α) = φ(α)/Φ(α) if truncation is x < c (6.3.3b)

and δ(α) = λ(α)[λ(α) α] (6.3.3c)

The function λ(α) is called the inverse Mills ratio.

The function in (6.3.3a) is also called the hazard
function for the standard normal distribution.
F. Guta (CoBE) Econ 607 February, 2019 93 / 139
6.3.1 Tobit (Censored Regression) Model

The regression model based on the moments of

censored normal variable is referred to as the
censored regression model or the tobit model
[in reference to Tobin (1958), where the model was
…rst proposed].
The regression is obtained by making the mean in
the censored normal variable correspond to a
classical regression model.
F. Guta (CoBE) Econ 607 February, 2019 94 / 139
The general formulation is usually given in terms of
an index function,

yi = xi0 β + ei
8
< 0 if y 0
i
yi =
: y if y > 0
i i

There are potentially three conditional mean

functions to consider, depending on the purpose of
the study.
For the index variable, sometimes called the latent
F. Guta (CoBE) Econ 607 February, 2019 95 / 139
variable, E [yi jxi ] is xi0 β.

Consistent with Theorem 6.1, for an observation

randomly drawn from the population, which may or
may not be censored,
xi0 β
E [yi jxi ] = Φ (xi0 β + σλi ) (6.3.4)
σ
where
φ [(0 xi0 β) /σ] φ (xi0 β/σ)
λi = =
1 Φ [(0 xi0 β) /σ] Φ (xi0 β/σ)
Finally, if we intend to con…ne our attention to
F. Guta (CoBE) Econ 607 February, 2019 96 / 139
uncensored observations, then the results for the
truncated regression model apply.
The limit observations should not be discarded,
however, because the truncated regression model is
no more amenable to least squares than the
censored data model.
There are di¤erences in the partial e¤ects. For the
index variable,
∂E [yi jxi ]
= β.
∂xi
F. Guta (CoBE) Econ 607 February, 2019 97 / 139
But this result is not what will usually be of
interest, because yi is unobserved. For the observed
data, yi , the following general result will be useful:
Theorem
6.3: Partial E¤ects in the Censored Regression Model

In the censored regression model with latent

regression y = xβ + e and observed dependent
variable, y = c1 if y c1 , y = c2 if y c2 , and
y = y otherwise, where c1 and c2 are constants, let
f (e) and F (e) denote the density and cdf of e.
F. Guta (CoBE) Econ 607 February, 2019 98 / 139
Theorem
Assume that e is a continuous random variable with
mean 0 and variance σ2 , and f (ejx) = f (e). Then

∂E [y jx]
=β Prob[c1 < y < c2 ].
∂x

Note that this general result includes censoring in

either or both tails of the distribution, and it does
not assume that e is normally distributed.
For the standard case with censoring at zero and
normally distributed disturbances, the result
F. Guta (CoBE) Econ 607 February, 2019 99 / 139
specializes to
∂E [yi jxi ] xi0 β
= βΦ . (6.3.5)
∂xi σ
Although not a formal result, this does suggest a
reason why, in general, least squares estimates of
the coe¢ cients in a tobit model usually resemble
the MLEs times the proportion of nonlimit
observations in the sample.
McDonald and Mo¢ tt (1980) suggested a useful
decomposition of ∂E [yi jxi ]/∂xi ,
F. Guta (CoBE) Econ 607 February, 2019 100 / 139
∂E [yi jxi ]
∂xi =β f Φi [1 λi (αi + λi )] + φi (αi + λi )g, (6.3.6)

where αi = xi0 β/σ, Φi = Φ (αi ) and λi = φi /Φi .

Taking the two parts separately, this result

decomposes the slope vector into

∂E [yi jxi ] ∂E [yi jxi , yi > 0] ∂Prob [yi > 0]

= Prob [yi > 0] + E [yi jxi , yi > 0]
∂xi ∂xi ∂xi
(6.3.7)

Thus, a change in xi has two e¤ects: It a¤ects the

conditional mean of yi in the positive part of the
distribution, and it a¤ects the probability that the
F. Guta (CoBE) Econ 607 February, 2019 101 / 139
observation will fall in that part of the distribution.
Estimation of The Tobit Model

The log-likelihood for the censored regression model

is
" 2
#
1 yi xi0 β xi0 β
ln L = ∑
2 y >0
ln 2π + ln σ2 +
σ2
+ ∑ ln 1 Φ
σ
.
i y i =0

(6.3.8)

This likelihood is a nonstandard type, because it is a

mixture of discrete and continuous distributions.
In a seminal paper, Amemiya (1973) showed that
F. Guta (CoBE) Econ 607 February, 2019 102 / 139
proceeding in the usual fashion to maximize ln L
would produce an estimator with all the familiar
desirable properties attained by MLEs.
Olsen’s (1978) reparameterization simpli…es things
considerably. With γ = β/σ and θ = 1/σ, the
log-likelihood is
1 h i
2
ln L =
2 ∑ ln 2π ln θ 2 + θ yi xi0 γ + ∑ ln 1 Φ xi0 γ . (6.3.9)
y i >0 y i =0

The Hessian is always negative de…nite, so Newton’s

method is simple to use and converges quickly.
F. Guta (CoBE) Econ 607 February, 2019 103 / 139
The original parameters can be recovered using
σ = 1/θ and β = γ/θ.
The asymptotic covariance matrix for these
estimates can be obtained from that for the
estimates of [γ, θ ] using the delta method:

[Link][ b b]
β, σ = b b, b
J [Link][γ θ] b
J,
2 3 2 3
6 ∂β/∂γ
0 ∂β/∂θ 7 6 (1/θ ) I 1/θ 2 γ 7
where J = 6 7=6 7.
4 5 4 5
2
∂σ/∂γ0 ∂σ/∂θ 00 1/θ

Almost without exception, it is found that the OLS

F. Guta (CoBE) Econ 607 February, 2019 104 / 139
estimates are smaller in absolute value than the
MLEs.

A striking empirical regularity is that the maximum

likelihood estimates can often be approximated by
dividing the OLS estimates by the proportion of
nonlimit observations in the sample.

6.3.2 Truncated Regression Model

We now assume that µi = xi0 β is the deterministic

part of the classical regression model. Then
F. Guta (CoBE) Econ 607 February, 2019 105 / 139
yi = xi0 β
where ei jxi N (0, σ2 ) ,
so that yi jxi N (xi0 β, σ2 )

We are interested in the distribution of yi given that

yi is greater than the truncation point c. This is the

result described in Theorem 6.2. It follows that

φ c xi0 β /σ
E [yi jyi > c ] = xi0 β + σ (6.3.10)
1 Φ c xi0 β /σ

The conditional mean is therefore a nonlinear

F. Guta (CoBE) Econ 607 February, 2019 106 / 139
function of c, σ, x, and β.
The partial e¤ects in this model in the
subpopulation can be obtained by writing

E [yi jyi > c ] = xi0 β + σλ (αi ) (6.3.11)

where now αi = (c xi0 β)/σ.

Let λi = λ (αi ) and δi = δ(αi ). Then

∂ E [ yi j yi > c ] d λ (αi ) ∂αi β
= β+σ = β + σ λ2i αi λi
∂xi d αi ∂xi σ

= β 1 λ2i + αi λi = β (1 δi ) (6.3.12)

F. Guta (CoBE) Econ 607 February, 2019 107 / 139

Note the appearance of the scale factor 1 δi from
the truncated variance.
Because 1 δi is between zero and one, we
conclude that for every element of xi , the marginal
e¤ect is less than the corresponding coe¢ cient.
There is a similar attenuation of the variance.
In the subpopulation yi > c, the regression variance
is not σ2 but

Var [yi jyi > c ] = σ2 (1 δi ) .

F. Guta (CoBE) Econ 607 February, 2019 108 / 139
The result in (6.3.11) is of interest if the analysis is
to be con…ned to the subpopulation.
If the study is intended to extend to the entire
population, however, then it is the coe¢ cients β
that are actually of interest.
One’s …rst inclination might be to use OLS to
estimate the parameters of this regression model.
For the subpopulation from which the data are
drawn, we could write (6.3.10) in the form
F. Guta (CoBE) Econ 607 February, 2019 109 / 139
yi jyi > c = E [yi jyi > c ] + ui = xi0 β + σλi + ui (6.3.13)

where ui is yi minus its conditional expectation.

By construction, ui has a zero mean, but it is
heteroscedastic:

Var [ui ] = σ2 1 λ2i + αi λi = σ2 (1 δi ) ,

which is a function of xi .
If we estimate (6.3.13) by ordinary least squares
regression of y on X, then we have omitted a
variable, the nonlinear term λi .
F. Guta (CoBE) Econ 607 February, 2019 110 / 139
Without some knowledge of the distribution of x, it
is not possible to determine how serious the bias is
likely to be.
If E [xjy ] in the full population is a linear function of
y, then plim b
β = βτ for some proportionality
constant τ.
This result is consistent with the widely observed
proportionality relationship between least squares
estimates of this model and maximum likelihood
estimates.
F. Guta (CoBE) Econ 607 February, 2019 111 / 139
In applications, it is usually found that, compared
with consistent maximum likelihood estimates, the
OLS estimates are biased toward zero.

6.3.3 Heckman’s Two Steps Sample Selection

[Link] Incidental Truncation in a Bivariate Distn¯

Suppose that y and z have a bivariate distribution

with correlation ρ.
We are interested in the distribution of y given that
z exceeds a particular value.
F. Guta (CoBE) Econ 607 February, 2019 112 / 139
Intuition suggests that if y and z are positively
correlated, then the truncation of z should push the
distribution of y to the right.
We are interested in (1) the form of the incidentally
truncated distribution and (2) the mean and
variance of the incidentally truncated random
variable.
Because it has dominated the empirical literature,
we will focus …rst on the bivariate normal
distribution.
F. Guta (CoBE) Econ 607 February, 2019 113 / 139
The truncated joint density of y and z is
f (y, z )
f ( y, z j z > c ) =
Prob (z > c )
To obtain the incidentally truncated marginal
density for y, we would then integrate z out of this
expression.
The moments of the incidentally truncated normal
distribution are given in following theorem.

Theorem
6.4: Moments of the Incidentally Truncated Bivariate
F. Guta (CoBE) Econ 607 February, 2019 114 / 139
Theorem
Normal Distribution

If y and z have a bivariate normal distribution with

means µy and µz , standard deviations σy and σz ,
and correlation ρ, then

E [ y j z > c ] = µy + ρσy λ (αz ) ,

Var [ y j z > c ] = σ2y 1 ρ2 δ ( αz ) ,

where αz = (c µz ) /σz , λ (αz ) = φ (αz ) / [1 Φ (αz )],

and δ (αz ) = λ (αz ) [λ (αz ) αz ].

F. Guta (CoBE) Econ 607 February, 2019 115 / 139
As expected, the truncated mean is pushed in the
direction of the correlation if the truncation is from
below and in the opposite direction if it is from
above.

[Link] Regression in A Model of Selection

To motivate a regression model that corresponds to

the results in Theorem 6.4, we consider the
following example.

F. Guta (CoBE) Econ 607 February, 2019 116 / 139

Example
A Model of Labor Supply: A model of female labor supply that

has been examined in many studies consists of two equations:

1. Wage equation: The di¤erence between a person’s market

wage, what she could command in the labor market, and her

reservation wage, the wage rate necessary to make her choose

to participate in the labor market, is a function of

characteristics such as age and education, as well as number

of children and where a person lives.

F. Guta (CoBE) Econ 607 February, 2019 117 / 139

Example
2. Hours equation: The desired number of labor hours supplied
depends on the wage, home characteristics such as whether

there are small children present, marital status, and so on.

The problem of truncation surfaces when we consider that the

second equation describes desired hours, but an actual …gure

is observed only if the individual is working. (In most such

studies, only a participation equation, that is, whether hours

are positive or zero, is observable.)

F. Guta (CoBE) Econ 607 February, 2019 118 / 139

Example
We infer from this that the market wage exceeds the

reservation wage. Thus, the hours variable in the second

equation is incidentally truncated.

To put the preceding examples in a general

framework, let the equation that determines the
sample selection be

zi = wi0 γ + ui ,

and let the equation of primary interest be

F. Guta (CoBE) Econ 607 February, 2019 119 / 139
yi = xi0 β + ei .

The sampling rule is that yi is observed only when

zi is greater than zero.
Suppose as well that ei and ui have a bivariate
normal distribution with zero means and correlation
ρ.
Then we may insert these in Theorem 6.4 to obtain
the model that applies to the observations in our
sample:
F. Guta (CoBE) Econ 607 February, 2019 120 / 139
E [ yi j yi is observed] = E [ yi j zi > 0]

= E [ yi j ui > wi0 γ]
= xi0 β + E [ ei j ui > wi0 γ]
= xi0 β + ρσe λi (αu )
= xi0 β + βλ λi (αu ) ,

where αu = wi0 γ/σu and λ(αu ) = φ(wi0 γ/σu )/Φ(wi0 γ/σu ). So,

yi j zi > 0 = E [ yi j zi > 0] + υi
= xi0 β + βλ λi (αu ) + υi
F. Guta (CoBE) Econ 607 February, 2019 121 / 139
Least squares regression using the observed
data— for instance, OLS regression of hours on its
determinants, using only data for women who are
working— produces inconsistent estimates of β.
Once again, we can view the problem as an omitted
variable.
Least squares regression of y on x and λ would be a
consistent estimator, but if λ is omitted, then the
speci…cation error of an omitted variable is
committed.
F. Guta (CoBE) Econ 607 February, 2019 122 / 139
Finally, note that the second part of Theorem 6.4
implies that even if λi were observed, then least
squares would be ine¢ cient. The disturbance υi is
heteroscedastic.
The marginal e¤ect of the regressors on yi in the
observed sample consists of two components.
There is the direct e¤ect on the mean of yi , which is
β. In addition, for a particular independent variable,
if it appears in the probability that zi is positive,

F. Guta (CoBE) Econ 607 February, 2019 123 / 139

then it will in‡uence yi through its presence in λi .
The full e¤ect of changes in a regressor that
appears in both xi and wi on y is
∂E [ yi j zi > 0] ρσe
= βk γk δi ( αu )
∂xik σu
where δi = λ2i αi λi .
Suppose that ρ is positive and E [yi ] is greater when
zi is positive than when it is negative. Because
0 < δi < 1, the additional term serves to reduce the
marginal e¤ect.
F. Guta (CoBE) Econ 607 February, 2019 124 / 139
The change in the probability a¤ects the mean of yi
in that the mean in the group zi > 0 is higher.
In most cases, the selection variable z is not
observed. Rather, we observe only its sign.
In labour market participation example, we typically
observe only whether a woman is working or not
working. We can infer the sign of z , but not its
magnitude, from such information.
Because there is no information on the scale of z ,
F. Guta (CoBE) Econ 607 February, 2019 125 / 139
the disturbance variance in the selection equation
cannot be estimated.
Thus, we reformulate the model as follows:

selection mechanism: zi = wi0 γ + ui , zi = 1 if zi > 0 and 0 otherwise;

prob ( zi = 1j wi ) = Φ wi0 γ ; and

prob ( zi = 0j wi ) = 1 Φ wi0 γ (6.3.14)

regression model: yi = xi0 β + ei , observed only if zi = 1,

0 1 20 1 0 13
B ui C 6B 0 C B 1 ρσe C7
B C N6 B C,B C7
@ A 4@ A @ A5
ei 0 ρσe σ2e

F. Guta (CoBE) Econ 607 February, 2019 126 / 139

Suppose that, as in many of these studies, zi and wi
are observed for a random sample of individuals but
yi is observed only when zi = 1.
This model is precisely the one we examined earlier,
with

E [ yi j zi = 1, xi , wi ] = xi0 β + βλ λ (wi0 γ) .

[Link] Tow-Step and ML Estimation

The parameters of the sample selection model can

be estimated by maximum likelihood.
F. Guta (CoBE) Econ 607 February, 2019 127 / 139
However, Heckman’s (1979) two-step estimation
procedure is usually used instead. Heckman’s
method is as follows:

1). Estimate the probit equation by maximum likelihood

to obtain estimates of γ. For each observation in
the selected sample, compute
b i = φ (w 0 γ 0
b ) and b b i (λ
b i + w0 γ
λ i b ) /Φ (wi γ δi = λ i b ).

2). Estimate β and βλ = ρσe by least squares

b
regression of y on x and λ.
F. Guta (CoBE) Econ 607 February, 2019 128 / 139
It is possible also to construct consistent estimators
of the individual parameters ρ and σe .
At each observation, the true conditional variance of
the disturbance would be

σ2i = σ2e 1 ρ2 δi

The average conditional variance for the sample

would converge to
1 n 2
plim ∑ σi = σ2e 1 ρ2 δ
n i=
F. Guta (CoBE) Econ 607 February, 2019 129 / 139
which is what is estimated by the least squares
residual variance e 0 e/n.

For the square of the coe¢ cient on λ, we have

2
plimb
βλ = ρ2 σ2e ,

whereas based on the probit results we have

1 n b
plim ∑ δi = δ.
n i=

We can then obtain a consistent estimator of σ2e

using
F. Guta (CoBE) Econ 607 February, 2019 130 / 139
b2e = n1 e 0 e + b
2
σ δbβλ .

Finally, an estimator of ρ2 is
2
2
b
βλ
b
ρ = , (6.3.15)
b2e
σ
which provides a complete set of estimators of the
model’s parameters.

To test hypotheses, an estimate of the asymptotic

0
covariance matrix of [ b
β ,bβλ ] is needed.
We have two problems to contend with.
F. Guta (CoBE) Econ 607 February, 2019 131 / 139
First, we can see in Theorem 6.4 that the
disturbance term in

( yi j zi = 1, xi , wi ) = xi0 β + ρσe λi + υi (6.3.16)

is heteroscedastic;

Var [ υi j zi = 1, xi , wi ] = σ2e 1 ρ2 δi

Second, there are unknown parameters in λi .

Suppose that we assume for the moment that λi and δi are

known (i.e., we do not have to estimate γ).

F. Guta (CoBE) Econ 607 February, 2019 132 / 139
For convenience, let xi = [xi , λi ], and let β be the
least squares coe¢ cient vector in the regression of y
on x in the selected data.

Then, using the appropriate form of the variance of

ordinary least squares in a heteroscedastic model,
we would have to estimate
" #
n
∑
1 1
Var b
β = σ2e 0
X X 1 2
ρ δi xi xi 0
X0 X
i =1
1 1
= σ2e X0 X X0 I ρ2 ∆ X X0 X

F. Guta (CoBE) Econ 607 February, 2019 133 / 139

where I ρ2 ∆ is a diagonal matrix with (1 ρ2 δi )
0 0
on the diagonal and b
β = b
β ,bβλ .
Without any other complications, this result could
be computed fairly easily using X, the sample
estimates of σ2e and ρ2 , and the assumed known
values of λi and δi .
The parameters in γ do have to be estimated using
the probit equation. Rewrite (6.3.16) as

b i + υi
( yi j zi = 1, xi , wi ) = xi0 β + βλ λ bi
βλ λ λi

F. Guta (CoBE) Econ 607 February, 2019 134 / 139

In this form, we see that in the preceding expression
we have ignored both an additional source of
variation in the compound disturbance and
correlation across observations; the same estimate
b i for every observation.
of γ is used to compute λ
Heckman has shown that the earlier covariance
matrix can be appropriately corrected by adding a
term inside the brackets,

ρ 2 X0 ∆
Q =b b W Est. Asy. Var[γ bX
b ] W0 ∆ ρ2 F
=b bVbF
b0

F. Guta (CoBE) Econ 607 February, 2019 135 / 139

b = Est. [Link][γ
where V b ], the estimator of the
asymptotic covariance of the probit coe¢ cients.

Any of the following three estimators may be used

b
to compute V.
( ) 1 ( ) 1 ( ) 1
N N N
∑ Hi (γb ) , ∑ si (γb ) si (γb ) 0
, ∑ A (xi , γb )
i =1 i =1 i =1

where

φ (wi γ) wi0 [yi Φ (wi0 γ)]

si (γ) =
Φ (wi γ) [1 Φ (wi0 γ)]
F. Guta (CoBE) Econ 607 February, 2019 136 / 139
is the score vector of the conditional log likelihood
for observation i,

[φ (wi0 γ)]2 wi0 wi

E [ Hi (γ)j xi ] = A ( wi , γ )
Φ (wi0 γ) [1 Φ (wi0 γ)]

is the expected value of the Hessian matrix

conditional on xi .

The complete expression is

" #
n
1 1
[Link] b
β = b2e
σ 0
X X ∑X 0
I 2b
ρ ∆ X +Q
b X0 X
i =1

The sample selection model can also be estimated

F. Guta (CoBE) Econ 607 February, 2019 137 / 139
by maximum likelihood.

The full log-likelihood function for the data is built

up from

Prob(selection) densityj selection for observations with zi = 1,

and Prob(nonselection) for observations with zi = 0.

Combining the parts produces the full log-likelihood

function,
" !#
exp (1/2) e2i /σ2e ρei /σe + wi0 γ
ln L = ∑ ln p
σe 2π
Φ p
1 ρ2
+ ∑ ln 1 Φ wi0 γ ,
z =1 z =0

F. Guta (CoBE) Econ 607 February, 2019 138 / 139

where ei = yi xi β.

Note: The FIML estimator with its assumption of bivariate

normality is not less robust than the two-step
estimator because the latter also requires bivariate
normality to form the conditional mean for the
regression.

F. Guta (CoBE) Econ 607 February, 2019 139 / 139

Introduction To Econometrics Cha - 5 LPM, Logit....
No ratings yet
Introduction To Econometrics Cha - 5 LPM, Logit....
27 pages
Binary and Multinomial Choice Models
No ratings yet
Binary and Multinomial Choice Models
15 pages
Econometrics II: Linear Regression Models
No ratings yet
Econometrics II: Linear Regression Models
49 pages
Logit vs Probit Models Explained
No ratings yet
Logit vs Probit Models Explained
44 pages
Logit and Probit Models Explained
No ratings yet
Logit and Probit Models Explained
9 pages
Limited Dependent Variable Models Explained
No ratings yet
Limited Dependent Variable Models Explained
17 pages
Avanzando en Modelos Lineales Generalizados
No ratings yet
Avanzando en Modelos Lineales Generalizados
30 pages
Binary Response Models in Microeconometrics
No ratings yet
Binary Response Models in Microeconometrics
11 pages
Estimating Binary Models in EViews 6
No ratings yet
Estimating Binary Models in EViews 6
12 pages
Logit and Probit Model Overview
No ratings yet
Logit and Probit Model Overview
37 pages
Limitations of Linear Probability Models
No ratings yet
Limitations of Linear Probability Models
71 pages
Logit vs Probit Models Explained
100% (1)
Logit vs Probit Models Explained
22 pages
Regression Models for Binary Outcomes
No ratings yet
Regression Models for Binary Outcomes
27 pages
Logit and Probit Models in Econometrics
No ratings yet
Logit and Probit Models in Econometrics
25 pages
Logit vs Probit Models Explained
No ratings yet
Logit vs Probit Models Explained
11 pages
Generalized Linear Models in Business Analytics
No ratings yet
Generalized Linear Models in Business Analytics
25 pages
Understanding Binary Response Models
No ratings yet
Understanding Binary Response Models
11 pages
Understanding Binary Dependent Variables
No ratings yet
Understanding Binary Dependent Variables
5 pages
Probit Model for Binary Outcomes
No ratings yet
Probit Model for Binary Outcomes
21 pages
Understanding the Probit Model
No ratings yet
Understanding the Probit Model
5 pages
Logit and Probit Models in Econometrics
No ratings yet
Logit and Probit Models in Econometrics
13 pages
Binary Choice Models in Economics
No ratings yet
Binary Choice Models in Economics
7 pages
Logit and Probit Models in GLMs
No ratings yet
Logit and Probit Models in GLMs
17 pages
Binary Dependent Variable Regression
No ratings yet
Binary Dependent Variable Regression
63 pages
R Programming for Logistic Regression
No ratings yet
R Programming for Logistic Regression
7 pages
Understanding Limited Dependent Variables
No ratings yet
Understanding Limited Dependent Variables
18 pages
Discrete Choice Models in Econometrics
No ratings yet
Discrete Choice Models in Econometrics
38 pages
LPM Estimation Challenges in Analysis
No ratings yet
LPM Estimation Challenges in Analysis
34 pages
Understanding Limited Dependent Variable Models
No ratings yet
Understanding Limited Dependent Variable Models
49 pages
Binary Outcome Data Analysis Techniques
No ratings yet
Binary Outcome Data Analysis Techniques
42 pages
Limited Dependent Variable Models Overview
No ratings yet
Limited Dependent Variable Models Overview
11 pages
Limited Dependent Variable Models Overview
No ratings yet
Limited Dependent Variable Models Overview
53 pages
Logit and Probit Models Explained
No ratings yet
Logit and Probit Models Explained
50 pages
Qualitative Multichotomy Models Explained
No ratings yet
Qualitative Multichotomy Models Explained
60 pages
Probit Model for Binary Outcomes
100% (1)
Probit Model for Binary Outcomes
29 pages
Logit vs Probit Models Explained
No ratings yet
Logit vs Probit Models Explained
22 pages
Binary Dependent Variables in Econometrics
No ratings yet
Binary Dependent Variables in Econometrics
38 pages
printable_2
No ratings yet
printable_2
58 pages
Linear Methods for Classification-1 (1)
No ratings yet
Linear Methods for Classification-1 (1)
15 pages
Logit Models Notes
No ratings yet
Logit Models Notes
29 pages
Comparing LPM, Logit, and Probit Models
No ratings yet
Comparing LPM, Logit, and Probit Models
21 pages
Models for Limited Dependent Variables
No ratings yet
Models for Limited Dependent Variables
24 pages
Logistic and Probit Regression Models
No ratings yet
Logistic and Probit Regression Models
9 pages
Overview of Classification Techniques
No ratings yet
Overview of Classification Techniques
26 pages
Qualitative Response Regression Models
No ratings yet
Qualitative Response Regression Models
35 pages
Tobit Regression Analysis in SPSS
No ratings yet
Tobit Regression Analysis in SPSS
19 pages
Logit, Probit, and Tobit Models Explained
No ratings yet
Logit, Probit, and Tobit Models Explained
19 pages
Probit Model: Definition and Application
No ratings yet
Probit Model: Definition and Application
17 pages
Linear Probability and Probit Models
No ratings yet
Linear Probability and Probit Models
21 pages
Lect8_DiscreteChoiceModel_20260116f
No ratings yet
Lect8_DiscreteChoiceModel_20260116f
46 pages
Understanding Limited Dependent Variable Models
No ratings yet
Understanding Limited Dependent Variable Models
12 pages
Interpreting and Comparing Effects in Logistic Probit and Logit Regression 1st Edition Jacques A P Hagenaars Full Access
No ratings yet
Interpreting and Comparing Effects in Logistic Probit and Logit Regression 1st Edition Jacques A P Hagenaars Full Access
105 pages
Discrete Choice Models Overview
No ratings yet
Discrete Choice Models Overview
93 pages
Binary Response Models in Econometrics
No ratings yet
Binary Response Models in Econometrics
17 pages
Binary Outcomes Models in Microeconometrics
No ratings yet
Binary Outcomes Models in Microeconometrics
42 pages
Chapter 6
No ratings yet
Chapter 6
78 pages
Probit Logit Ohio PDF
No ratings yet
Probit Logit Ohio PDF
16 pages
Discrete Choice Models in Econometrics
No ratings yet
Discrete Choice Models in Econometrics
58 pages
Topic4 Handout
No ratings yet
Topic4 Handout
29 pages
Cerium Isotope Analysis and Calculations
No ratings yet
Cerium Isotope Analysis and Calculations
6 pages
Urban Groundwater Quality in Syokimau
No ratings yet
Urban Groundwater Quality in Syokimau
75 pages
SUC 100 & 150 E Spare Parts List
No ratings yet
SUC 100 & 150 E Spare Parts List
7 pages
Voyager A Novel by Gabaldon, Diana, Author Instant Download Ebook Testbank Solutions Unlocked PDF Edition
100% (1)
Voyager A Novel by Gabaldon, Diana, Author Instant Download Ebook Testbank Solutions Unlocked PDF Edition
53 pages
Introduction to Artificial Neural Networks
No ratings yet
Introduction to Artificial Neural Networks
15 pages
Principles of Experimental Design
No ratings yet
Principles of Experimental Design
73 pages
Green Line Paribahan Ticket Details
No ratings yet
Green Line Paribahan Ticket Details
2 pages
Mahavitaran Cut-Off Results 2024
No ratings yet
Mahavitaran Cut-Off Results 2024
1 page
Final Exam: Theory of Structures
No ratings yet
Final Exam: Theory of Structures
2 pages
AWWA C207 Flange Specifications
100% (1)
AWWA C207 Flange Specifications
14 pages
Unfinished Actions Finished Actions: (For Something That Started in
No ratings yet
Unfinished Actions Finished Actions: (For Something That Started in
3 pages
CBD Potency Analysis Report - Carolinas
No ratings yet
CBD Potency Analysis Report - Carolinas
1 page
SAP SuccessFactors Salary Planning Guide
No ratings yet
SAP SuccessFactors Salary Planning Guide
56 pages
Uber's Marketing and Business Strategies
No ratings yet
Uber's Marketing and Business Strategies
4 pages
Hal Leonard Music Arrangements Collection
100% (8)
Hal Leonard Music Arrangements Collection
47 pages
Buglo Playground Specifications and Safety
No ratings yet
Buglo Playground Specifications and Safety
2 pages
CFA Program
No ratings yet
CFA Program
95 pages
Voynich Manuscript: An Elegant Enigma
No ratings yet
Voynich Manuscript: An Elegant Enigma
163 pages
Linear Regression Indicator in Pine Script
No ratings yet
Linear Regression Indicator in Pine Script
2 pages
Past Tense Exercises for Art & Music
100% (1)
Past Tense Exercises for Art & Music
4 pages
AIN2601 Practical Accounting Overview
No ratings yet
AIN2601 Practical Accounting Overview
134 pages
Green Tea Coconut Raspberry Recipe
No ratings yet
Green Tea Coconut Raspberry Recipe
4 pages
Zigzag Theory for Laminated Plates
No ratings yet
Zigzag Theory for Laminated Plates
63 pages
St. Elizabeth Kamuthi Exam Questions
No ratings yet
St. Elizabeth Kamuthi Exam Questions
10 pages
HUL's Project Shakti: Transformational Change
No ratings yet
HUL's Project Shakti: Transformational Change
24 pages
A5 and Maverick Supplement Overview
No ratings yet
A5 and Maverick Supplement Overview
6 pages
High Power Current Sense Resistor Specs
No ratings yet
High Power Current Sense Resistor Specs
4 pages
My Pet - Unit 4: Material
No ratings yet
My Pet - Unit 4: Material
14 pages
APL Valve Position Monitor Catalogue
No ratings yet
APL Valve Position Monitor Catalogue
6 pages
Essential NBME Exam Study Questions
No ratings yet
Essential NBME Exam Study Questions
3 pages