Chapter 1:
Regression Analysis with Qualitative Data:
Binary (or Dummy Variables)
4.1. Describing Qualitative Information
In Econometrics I, both dependent and independent variables
in our multiple regression models are quantitative in their
nature (e.g., hourly wage rate, years of education, GDP, prices,
and costs).
However, some variables are essentially qualitative or nominal
scale, in nature, such as sex, race, color, religion, industry of a
firm (manufacturing, retail, etc.), and the regions in Ethiopia.
For example, holding all other factors constant, female workers
are found to earn less than their male counterparts.
Since such variables usually indicate presence or absence of a
“quality” or an attribute, such as male or female, black or
white, collage graduate or not collage graduate they are
essentially nominal scale variables.
Cont’d
One way we could “quantify” such attributes is by constructing
artificial variables that take on values of 1 or 0, 1 indicating the
presence (or possession) of that attribute and 0 indicating the absence of
that attribute.
For example 1 may indicate that a person is a female and 0 may designate
a male; or 1 may indicate that a person is a college graduate, and 0 that
the person is not, and so on.
Variables that assume such 0 and 1 values are called dummy variables.
Such variables are thus essentially a device to classify data into mutually
exclusive categories such as male or female.
Dummy variables can be incorporated in regression models just as easily
as quantitative variables.
As a matter of fact, a regression model may contain regressors that are all
exclusively dummy, or qualitative, in nature.
Cont’d
Note that although they are easy to incorporate in the regression
models, one must use dummy variables carefully.
Particularly consider the followings:-
1. When we have dummy variable for each category or group
and also intercept in our model, we have a case of perfect
collinearity, that is, exact linear relationships among the
independent variables.
The sum of all the dummy variables is one.
In this case if a qualitative variable has m categories, introduce
only (m − 1) dummy variables.
Other wise we fall into what is known as the dummy variable
trap. That is, the situation of perfect collinearity or perfect
multicollinearity arises.
Cont’d
This rule also applies if we have more than one qualitative
variable in the model.
For each qualitative regressor the number of dummy variables
introduced must be one less than the categories of that variable.
2. The category for which no dummy variable is assigned is known
as the base, benchmark, control, comparison, reference, or
omitted category and all comparisons are made in relation to the
benchmark category.
This is the one that is omitted and against which the other
dummy variables are assessed.
3. The intercept value (β1) represents the mean value of the
benchmark category
Cont’d
4. The coefficients attached to dummy variables are known as the
differential intercept coefficients because they tell by how
much the value of the intercept that receives the value of 1 differs
from the intercept coefficient of the benchmark category.
5. If a qualitative variable has more than one category, the choice of
the benchmark category is strictly up to the researcher
4.2. Dummy as Independent Variables
4.2.1. A Single Dummy Independent Variable
Consider the simple model of hourly wage determination:
𝑤𝑎𝑔𝑒=𝛽 1+𝛿 𝐷+ 𝛽2 𝑒𝑑𝑢+𝜀𝑖
In our model only two observed factors affect wage rate :- gender
and education.
Since when the person is female, and when the person is male, the
parameter has the following interpretation: is the difference in
hourly wage between females and males, given the same amount
of education.
Thus, the coefficient determines whether there is
discrimination against women: if , then, for the same level of
other factors, women earn less than men on average.
Cont’d
Note that :-
1. In our model the base group is male (D=0) and hence the
interpretation of the coefficient of dummy is made against the
base group.
If the coefficient is less than zero, the females are paid less
compared to their male counterparts for the same level of
education.
But if its coefficient is positive, females are paid more compared
to males.
2. In any application, it does not matter how we choose the base
group
Cont’d
Some researchers prefer to drop the overall intercept in the model
and to include dummy variables for each group.
The equation would then be , where the intercept for men is and
the intercept for women is .
There is no dummy variable trap in this case because we do not
have an overall intercept.
However, this formulation has little advantage, since testing for a
difference in the intercepts is more difficult, and there is no
generally agreed upon way to compute R-squared in regressions
without an intercept.
Therefore, we will always include an overall intercept for the
base group.
Cont’d
Question, is the difference between the females and males earnings
statistically significant or due to chance factor?
We need to test that!!
In general, suppose simple linear regression model takes the form:
When D=1 the model becomes:
When D=0 the model becomes:
Thus, given the zero mean assumption (i.e., ), the mean of Y
is: ,when D=1 and when D=0
Note that both means have the same slope () but they differ in
their intercepts.
Cont’d
Given the assumption of classical linear regression model, a model
with one or more dummy variables can be estimated using the
OLS estimation method.
Once the model is estimated, we have to test whether the
coefficients of dummy are statistically significant or not.
Suppose we have the model with one dummy variable:
Now, test significance of . That is,
We can test this using the usual t-test:
Cont’d
Decision Rule:
Reject the null hypothesis if .
This means that existence of the attribute is statistically
significant.
Example:
The negative intercept- mean for men, in this case-is meaningless.
The coefficient on D is interesting, because it measures the average
difference in hourly wage between a woman and a man, given the
same levels of educ, exper, and tenure.
Cont’d
If we take woman and man with same levels of education, experience,
and tenure, woman earns, on average, $1.81 less per hour than the man.
It is important to remember that, because we have performed multiple
regression and controlled for educ, exper, and tenure, the $1.81 wage
differential cannot be explained by different average levels of education,
experience, or tenure between men and women.
We can conclude that differential of $1.81 is due to gender or factors
associated with gender that we have not controlled for in regression.
Is this wage differential statistically significant?
The usual t-test is given by: .
Using the rule of thumb since |t|>2, we reject null hypothesis and hence
the wage differential is statistically significant.
Cont’d
Now, suppose all non-dummy explanatory variables are dropped
from our model.
Then the result becomes:
Where D= 1 implies female and D = 0 means male
Interpretations of OLS estimates:
The intercept is the average wage for men in the sample (when D
= 0). Thus, on average, males earn $7.10 per hour.
The coefficient on D is the difference in the average wage between
females and males. Thus, the average wage for females in the
sample is 7.10 - 2.51 = 4.59, or $4.59 per hour.
Cont’d
Comparing the mean wage of males and females, the mean wage rate
of males is higher by $2.51 per hour.
Generally, simple regression on a constant and a dummy variable is a
straightforward way to compare the means of two groups.
Since t = -8.37, the difference is statistically significant.
For the usual t test to be valid, we must assume that the
homoskedasticity assumption holds, which means that the population
variance in wages for men is the same as that for women.
The estimated wage differential between men and women is larger in
simple regression model than in multiple regression model because
simple regression model does not control for differences in
education, experience, and the like.
Multiple regression model gives more reliable estimate of ceteris
paribus gender wage gap; it still indicates a very large differential.
[Link] Dummy Variables Regression Models
Suppose we have several dummy explanatory variables.
For simplicity let Y be monthly salaries of pubic school teachers
in Addis Ababa, Amhara and Oromia (three regions in Ethiopia).
Let D1 = 0 if region is Addis Ababa, D2 = 1 if region is Oromia and
D3 = 1 if region is Amhara. Now, Addis Ababa is base group.
Then multiple linear regression model (assuming all independent
variables are dummy variables) is given by:
Where,
Cont’d
Assuming that error term satisfies usual OLS assumptions, on
taking expectation of (1) on both sides, we obtain:
Mean salary of public school teachers in Oromia region:
E(Y |D2 = 1, D3 = 0) = β1 + β2
Mean salary of public school teachers in the Amhara region is:
E(Y |D2 = 0, D3 = 1) = β1 + β3
mean salary of teachers in the Addis Ababa region is given by:
E(Y |D2 = 0, D3 = 0) = β1
In other words, mean salary of public school teachers in Addis
Ababa region is given by intercept β1.
In multiple regression “slope” coefficients β2 and β3 tell by how
much mean salaries of teachers in Oromia region and in Amhara
region differ from mean salary of teachers in Addis Ababa region.
Cont’d
But are these differences statistically significant?
Let results based on our multiple regression model are as follow:-
76)
(1435.953) (1499.615)
Where * indicates p values.
As these regression results show, mean salary of teachers in Addis
Ababa is about Birr 26,158, that of teachers in Oromia region is
lower by about Birr 1,734 and that of teachers in Amhara region is
lower by about Birr 3,265.
The actual mean salaries in two regions can be easily obtained by
adding these differential salaries to mean salary of teachers in
Addis Ababa region.
Cont’d
Thus, mean salary in Oromia region is Birr 24,424 (=26,158 –
1,734) and mean salary in Amhara region is Birr 22,893 (26,158 –
3,265).
However, how do we know that these mean salaries are statistically
different from mean salary of teachers in Addis Ababa region,
comparison category?
All we have to do is to find out if each of “slope” coefficients is
statistically significant.
As can be seen from this regression, estimated slope coefficient for
Oromia region is not statistically significant, as its p value is
about 23%, whereas that of Amhara region is statistically
significant, as p value is only about 3.5%.
Cont’d
Therefore, the overall conclusion is that statistically mean
salaries of public school teachers in Addis Ababa region and
Oromia region are about the same but mean salary of teachers in
Amhara region is statistically significantly lower by about Birr
3,265.
Note that dummy variables will simply point out the
differences, if they exist.
However, they do not suggest reasons for differences.
4.2.3. Interactions among Dummy Variables
Consider the following wage model with two dummy independent
variables (D2 = 1 if sex is female and 0 otherwise, D3 = 1 if race is
nonwhite 0 otherwise):
where: Y = hourly wage in dollars
edu = education (years of schooling)
D2 = 1 if female, 0 otherwise
D3 = 1 if nonwhite, 0 otherwise
In this model gender and race are qualitative regressors and
education is quantitative regressor.
Implicitly this model assumes that differential effect of gender dummy
(D2) is constant across two categories of race and differential effect
of race dummy (D3) is also constant across two genders.
Cont’d
That is to say, if mean salary is higher for males than for
females, this is so whether they are nonwhite or not.
Likewise, if, say, nonwhite have lower mean wages, this is so
whether they are females or males.
In many applications such assumption may be unsound.
A female nonwhite may earn lower wages than male nonwhite.
In other words, there may be interaction between two qualitative
variables D2 and D3.
Therefore their effect on mean Y may not be simply additive as in
case of above equation rather they have multiplicative effect given
in the following model:-
Cont’d
Assuming that error term has zero mean (i.e., zero mean
assumption)
Then,
This is mean hourly wage function for female nonwhite workers.
Note that:
α2 = differential effect of being female
α3 = differential effect of being nonwhite
α4 = differential effect of being female nonwhite
The mean hourly wages of female nonwhite is different (by α4)
from mean hourly wages of males white.
cont’d
If, for instance, all three differential dummy coefficients are
negative, this would imply that female nonwhite workers earn
much lower mean hourly wages than male white workers as
compared with base category, which in present example is male
white.
Numerical example on Average Hourly Earnings in Relation to
Education, Gender and Race:-
Now test statistical significance of differential intercept
coefficients
t-values indicate that differential intercept coefficients are
statistically significantly different from zero.
Cont’d
Our estimation result shows, ceteris paribus, average hourly
earnings of females are lower by about Birr 2.36 compared to their
male counterparts and average hourly earnings of nonwhite
workers are also lower by about Birr 1.73 compared to their white
counterparts.
Now consider case of interaction of dummy variables:
The two additive dummies are still statistically significant, but the
interactive dummy is not at conventional 5% level; the actual p
value of interaction dummy is about 8% level and hence it is
statistically significant at 10% level of significance.
Cont’d
Interpretation: holding level of education constant, if you add
three dummy coefficients you will obtain: −1.964 (= −2.3605 −
1.7327 + 2.1289), which means that mean hourly wages of
nonwhite female workers is lower by about Birr 1.96, which is
between value of −2.3605 (gender difference alone) and −1.7327
(race difference alone) than white males.
4.3 Dummy as Dependent Variable
So far, we considered dummy variables as right hand side or
independent variables.
In all our models up until now, dependent variable y has had
quantitative meaning (for example, y is Birr amount).
What happens if we want to use multiple regression model to
explain qualitative dependent event like:-
a) participating in labor force or not
b) willing to pay for improved environmental quality or not
c) using contraceptives or not
d) voting for given election or not, etc.
In this case, our dependent variable takes on only two values:
zero and one (i.e., it is dummy variable).
In other words, regressand is binary, or dichotomous, variable.
Cont’d
For instance, if our dependent variable is decision to participate
in labor force, the response variable is 1= participate in labor
force and 0=not participate in labor force.
Such binary variable can be analyzed in general probability
models, which can be binomial or multinomial models.
We begin our study of qualitative response models with case of
binary choice model (where dependent variable is binary which
assumes a value 1 or 0).
There are three approaches to develop model for binary
(qualitative) response regression:
1. Linear Probability Model (LPM)
2. Logit Model
3. Probit model
Linear Probability Model (LPM)
In such models we have important equation:
P( y 1 / x) 0 1 x1 2 x2 k xk
which says that the probability of success, that is, P(y=1/x), is
linear function of explanatory variables.
P(y=1/x) is also called response probability.
This model is example of binary response model and since
probabilities must sum to one, probability of failure which is
P(y=0/x)=1-P(y=1/x) is also linear function of explanatory
variables.
cont’d
The above model with binary dependent
variable is called linear probability
model (LPM).
It is linear because the response
probability is linear in parameters .
In LPM the coefficient, , measures change
in probability of P(y
success k xk changes,
1 /x) when
holding other factors fixed:
Given this, the mechanics of OLS can be
used to estimate the model the same as
before because the model is linear.
cont’d
Given random sample with k parameters and N number of
observations and consider the following linear regression model:
y = β1 + β2x2 + ... + βkxk + µ
As it is probability model, in order to interpret results in terms of
probability we take expectations on both sides of equation
If we estimate predicted equation we get:
yˆ ˆ0 ˆ1 x1 ˆ2 x2 ˆk xk
The slope coefficient for x1 measures predicted change in
probability of success when x1 increases by one unit.
However, in order to correctly interpret linear probability model,
we must know what constitutes a “success.”
Thus, it is good idea to give dependent variable a name that
describes event y = 1.
cont’d
since E(e) = 0 , E(y|x) = b0 + b1x1 + … + bkxk
But from probability theory
E(y|x) = 0.P(y = 0|x) + 1.P(y = 1|x)
=0+ P(y = 1|x),
so we can write our model as
P(y = 1|x) = E(y|x) = b0 + b1x1 + … + bkxk
So, the interpretation of bj is change in probability of success
when xj changes by a unit
The predicted y is predicted probability of success
cont’d
Now, if Pi = probability that Yi = 1 (that is, the event occurs), and
(1 − Pi) = probability that Yi = 0 (that is, the event does not occur),
the variable Yi has the following (probability) distribution.
cont’d
That is, Yi follows the Bernoulli probability distribution. Now the mathematical
expectation of Y is given by: E (Y ) 0(1 p) 1* p p , which is equal to the
probability of success or the conditional expectation of Y given X (i.e.,
E ( y / x) 1 2 x2 3 x3 ... k xk ) and the variance is given by:
var(Y ) p(1 p)
In general, the expectation of a Bernoulli random variable is the probability that the
random variable equals 1.
LPM, Numerical Example
Suppose inlf (“in the labor force”) is a binary variable indicating
labor force participation by a married woman during a given year:
inlf =1 if the woman reports working for a wage outside the
home at some point during the year, and zero otherwise.
We assume that labor force participation depends on other sources
of income, including husband’s earnings (nwifeinc), years of
education (educ), past years of labor market experience (exper),
age, number of children less than six years old (kidslt6), and
number of kids between 6 and 18 years of age (kidsge6).
The estimated linear probability model using 753 sample size of
which 428 are women is given as:
Cont’d
Using the usual t statistics, all variables of this estimated model
except kidsge6 are statistically significant, and all of the significant
variables have the effects we would expect based on economic
theory.
In order to interpret the estimates, we must remember that a change
in the independent variable changes the probability that inlf =1.
Cont’d
For example, the coefficient on edu means that keeping other
factors constant, another year of education increases probability of
labor force participation by .038.
If we take this equation literally, 10 more years of education
increases probability of being in labor force by .038(10) = 0.38,
which is a large increase in a probability.
The coefficient on nwifeinc implies that, if nwifeinc = 10
(probability that woman is in labor force falls by .034.
Experience has been entered as quadratic to allow effect of past
experience to have diminishing effect on labor force
participation probability.
Holding other factors fixed, the estimated change in the probability
is approximated as: 0.039 - 2(0.0006)exper = 0.039 - 0.0012 exper,
using exponent rule of derivative!
Problems with LPM
1. Non-normality of the error term
Although OLS point estimation does not require the normality
assumption of the disturbances term, statistical inferences
(interval estimation and hypothesis testing) require that the
disturbance term must be normally distributed.
However, the assumption of normality for the error term is not
acceptable for LPMs due to the fact that, like Yi, the disturbances
term also take only two values; that is, they also follow the
Bernoulli distribution.
Given our LPM (in matrix form), the error term is given by: i y x . Thus,
1 x when the event occurs and i x when the event doesn’t occur. Then the
probability distribution of i is:
Cont’d
This implies that the disturbance terms cannot be assumed to be
normally distributed; they follow the Bernoulli distribution.
The violation of the normality assumption has serious effects on
statistical inferences.
2. The error term is heteroskedastic
In the LPM, the disturbance terms are not homoscedastic.
As statistical theory shows, for Bernoulli distribution theoretical
mean and variance are, respectively, p and p(1 − p), where p is
the probability of success (i.e., something happening), showing that
the variance is a function of the mean and hence the error variance
is heteroscedastic.
The variance of the error term is given by: var( ) pi (1 pi ) x (1 x )
.That is, the variance of the error term in the LPM is heteroscedastic.
Cont’d
Since Pi = E(Yi | Xi) = β1 + β2Xi , the variance of i ultimately depends on the values of
X and hence is not homoscedastic.
In the presence of heteroscedasticity, the OLS estimators, although
unbiased, are not efficient; that is, they do not have minimum
variance.
Since error term is heteroskedastic, we use Generalized Least
Square for estimation
Since the variance of i depends on E(Yi | Xi ), one way to resolve the heteroscedasticity
problem is to transform the model y 1 2 x2 by dividing it through by
.
The error term of this model is homoscedastic.
Cont’d
In order to estimate wi , we can use the following two-step
procedure:
Step 1. Run the OLS regression y 1 2 x2 despite the heteroscedasticity problem and
obtain ˆYi = estimate of the true E(Yi | Xi). Then obtain ˆwi = ˆYi(1 − ˆYi), the
estimate of wi.
Step 2. Use the estimated wi to transform the data and estimate the transformed equation
by OLS (i.e., weighted least squares).
Cont’d
3. The possibility of obtaining probability values <0 or >1
( 0 E ( y / x 1 or 0 p 1)
Non-fulfillment of
The problem is that LPM implicitly assumes that increases in x has
constant effect on probability of success
That is, as x increases P(y = 1) continues to increase at constant
rate
However, since 0≤P≤1, constant rate of increase is impossible
To overcome this problem we consider nonlinear models: probit &
logit (which are also known as limited dependent variables)
Limited dependent variable (LDV) models
These models refer to a dependent variable whose range of values
is substantively restricted.
In LPM, problem is non-fulfillment of restriction:
In order to overcome this problem, we need a model that will
produce predictions consistent with the underlying probability
theory for a given vector of regressors.
The probability model has two basic features/requirements:
1. As Xi increases, Pi = E(Y = 1|X) increases but never steps outside
the interval
2. The relationship between Pi and Xi is nonlinear, that is, “Pi
approaches 0 at slower and slower rates as Xi gets small (or X
approaches ) and approaches 1 at slower and slower rates as Xi
gets very large (or X approaches ).’’
Cont’d
Symbolically,
lim p ( y 1/ x) 1 and
x
lim p ( y 1/ x) 0
x
Graphically,
Cont’d
In both cases (symbol and graph), probability lies between 0 and 1
as Xi’s vary from () and the graph is sigmoid or S-shaped, which is
shape of cumulative distribution function (CDF) of any probability
density function (PDF).
However, the problem is that since all CDF of PDF are S-shaped,
which CDF should we use?
The commonly used CDFs are cumulative logistic distribution
and cumulative normal distribution.
The cumulative logistic distribution is giving rise to Logit Model
and cumulative normal distribution is giving rise to Probit
Model
The Logit Model
To explain basic ideas behind logit model, let us take simple
example of house ownership, defined as, y=1 if individual owns
house and zero otherwise.
Suppose that probability to own house is function of income, we
can state LPM as;
Now consider the following representation of house ownership
given by the form; G(z) = exp(z)/[1 + exp(z)] = L(z)
Another common choice for G(z) is logistic function, which is
CDF for standard logistic random variable
This case is referred to as logit model, or sometimes as logistic
regression
Both functions (normal and logistic) have similar shapes – they
are increasing in z, most quickly around 0
The Logit Model
For ease of exposition we can re-write the
above function as;
This equation is known as logistic
distribution function.
The Logit Model
Under this case , probability, Pi ranges between 0 and 1, as Zi
ranges from −∞ to +∞.
One problem of LPM resolved, but now we have created another
problem, that is, Pi is non-linearly related to Zi (or explanatory
variables) and also to parameters(β’s).
So the model is non-linear and thus we cannot use OLS
procedure to estimate parameters.
However, the problem of non-linearity may be resolved through
log transformation as follows: If Pi is the probability of owning
a house, then (1− Pi), is the probability of not owning a house.
Thus, we have;
Cont’d
Therefore, we can write;
The ratio of Pi to 1 - Pi is termed as odds ratio in favor
of owning a house. It is simply the ratio of the probability
that a family will own a house to the probability that it
will not own a house.
Now if we take natural log of this equation, we obtain:
Cont’d
Li is log of odds ratio and it is not only linear in X, but also
linear in parameters.
Li is also called logit, hence name logit model for models like
this.
Estimation of logit:- Method of the Maximum Likelihood
The logistic function is introduced or invented in the 19th
century (by Verhulst, 1804-1849) for the description of population
growth (Cramer, 2003).
Consider the LPM:-
Suppose pi is the probability that yi =1 and (1 pi ) is the probability that
yi 0 .
Cont’d
In order to construct the likelihood function, we note that the
yi 1 yi
contribution of the i th p (1 p )
observation can be written as: i i
In the case of random sampling where all observations are sampled
independently (the binomial distribution), the likelihood function
will simply be the product of the individual contributions given as
follows:
Cont’d
The technique of maximum likelihood requires that we choose those
values of the parameters of the LPM which maximize the likelihood
function given above.
In practice, we maximize the logarithm of the likelihood function:
Cont’d
But we know that:
Now substituting into the last equation we get:
This model is non-linear and hence it requires iterative solution
Thus, in MLE method our objective is to maximize logarithm of
likelihood function to obtain values of unknown parameters in
such a manner that probability of observing given Y’s is as
maximum as possible.
For this purpose, we differentiate the logarithm of the likelihood
function partially with respect to each unknown, set the resulting
expressions to zero and solve the resulting expressions.
Cont’d
Important features of the logit model:
1. Although the probabilities lie between 0 and 1, the logits(L)
are not so bounded(do not lie between 0 and 1).
2. Although L is linear in X, the probabilities themselves are
not. This property is in contrast with the LPM, where the
probabilities increase linearly with X.
3. It is possible to add as many explanatory variables as may
be dictated by the underlying theory.
cont’d
4. Interpretation:
The interpretation of the logit model given in above is as
follows:
β2, the slope coefficient, measures the change in L for a unit
change in X, that is, it tells how the log-odds ratio in favor of
owning a house change as income changes by a unit.
The intercept β1 is the value of the log odds in favor of
owning a house if income is zero.
5. Given a certain level of income, say, X*, if we want to estimate
not the odds in favor of owning a house but the probability of
owning a house itself, this can be done directly from logistic
distribution function once the estimates of β1 and β2 are
available.
The Logit Model
However, the important question is; How do
we estimate β1 and β2 in the first place?
The estimation procedure is given in the
next section.
[Link], whereas the LPM assumes that Pi is
linearly related to Xi, the logit model
assumes that the log of the odds ratio is
linearly related to Xi .
Probit Model
Probit model is very much similar to logit model and in most
applications both models give quite similar results.
The only difference lies on distribution they assume(apply).
Logit model uses logistic cumulative distribution (function),
where as probit model assumes normal cumulative distribution
function(CDF).
Probit Model
The cumulative standard normal curve resembles logistic curve
Probit has z scores instead of logged odds along horizontal axis.
The curve approaches, but does not reach 0 as z scores decrease
toward negative infinity,
the curve approaches but does not reach 1 as z scores increase
toward positive infinity.
Despite this difference, they give essentially equivalent results,
making the choice between them one of individual preferences
and computer program availability
The Probit Model
Based on cumulative standard normal distribution, cumulative
probability associated with any z score equals:
Where u is a random variable with mean of 0 and standard
deviation of 1.
The formula merely says that probability of event equals area
under cumulative normal curve between negative infinity and Z.
The larger the value of Z, the larger the cumulative probability.
Because of complexity of formula, however, computers do the
calculations.
The Probit Model
With probit as dependent variable, estimated coefficients show
change in z score units of inverse of cumulative standard normal
distribution rather than change in probabilities.
Like logistic regression, probit analysis allows calculation of
changes in probabilities for specified values of independent
variables.
Again, however, the effects of dummy and continuous variables
on predicted probabilities depend on choice of starting point.
Changes in probabilities will emerge larger for points near
middle of curve than near floor or ceiling.
The Probit Model
Recall linear probability model, written as P(y = 1|x) = xb
An alternative is to model probability as function, G(xb), where
0<G(z)<1
One choice for G(z) is standard normal cumulative distribution
function (cdf)
G(z) = F(z) = P(Z≤z) ≡ ∫f(v)dv, where f(z) is standard normal, so
f(z) = (2p)-1/2exp(-z2/2)
Thus, model expresses probability that y = 1 as P(Z≤xb ) = F(xb )
This case is referred to as probit model
Since it is nonlinear model, it cannot be estimated by our usual
methods, so we use maximum likelihood estimation
The Probit Model
Standard normal cumulative
distribution function
Standard normal probability
density function
Probits and Logits
Both probit and logit are nonlinear and require maximum
likelihood estimation
No real reason to prefer one over the other
Traditionally saw more of logit, mainly because logistic
function leads to a more easily computed model
Today, probit is easy to compute with standard packages, so
more popular
If we see functional forms or CDF:
1. Logistic distribution with
Cont’d
2. Standard normal distribution with
Interpretation
In general we care about the effect of x on P(y = 1|x), that is, we
care about ∂p/ ∂x
For LPM, this is easily computed as coefficient on x
For nonlinear probit and logit models, it’s more complicated:
Using chain rule ∂p/ ∂xj = dG/dz*dz/dxj = G’(xb)bj
For probit: f(xb) bj
For logit: {exp(xb)/[1 + exp(xb)]}bj
It is incorrect to just compare the coefficients across the three
models (Coefficients differ among models because of the
functional form of the CDF)
Interpretation (continued)
Interpretation of marginal effects
An increase in x increases (decreases) the probability that y=1 by
the marginal effect expressed as a percent
For dummy independent variables, marginal effect is expressed
in comparison to the base category (x=0).
For continuous independent variables, marginal effect is
expressed for a one-unit change in x.
We can compare sign and significance (based on standard t
test) of coefficients, though to compare magnitude of effects,
we need to calculate the derivatives, say at the means
Stata will do this for you in the probit case
Example: probit_insurance.dta
LPM Probit Logit Interpretation of coefficients
Retired 0.04* 0.11* 0.19* Retired individuals (in comparison to
Age -0.002 -0.008 -0.01
non-retired individuals), individuals
Good health status 0.06* 0.19* 0.31*
with good health status, higher
HH income 0.0004* 0.001* 0.002*
Education years 0.02* 0.07* 0.11*
household income, higher education,
Married 0.12* 0.36* 0.57* married are more likely to have
Hispanic -0.12* -0.46* -0.81* health insurance, and Hispanic are
Constant 0.12 -1.06* -1.71* less likely to have health insurance
R2 0.08 0.07 0.07
LPM, probit and logit coefficients
differ by a scale factor (and therefore
we cannot interpret the magnitude of
the coefficients).
Example: probit_insurance.dta
LPM Probit Logit Interpretation of marginal effects
Retired 0.04* 0.04* 0.04*
Retired individuals are 4% more likely to have
insurance (in comparison with those that are not
Age -0.002 -0.003 -0.003
retired).
For each additional year in education,
Good health status 0.06* 0.07* 0.07*
individuals are 2% more likely to have insurance
HH income 0.0004* 0.0005* 0.0004*
Hispanics are 16% less likely to have insurance
Education years 0.02* 0.02* 0.02*
than non-Hispanics
Married -0.12* -0.12* -0.13* Note that unlike the coefficients which are
Hispanic -0.12* -0.16* -0.16* different, the marginal effects are almost
identical in the three models.
Testing Hypotheses and Measures of Goodness-of-fit
Testing Statistical Significance of Each Slope Coefficient
The procedure of testing significance of each coefficient in LDV
model is the same as in the case of the usual OLS.
However, the z statistics in the stata output are approximation to
t statistics in the OLS.
Note that this z has nothing to do with the Z-score/variable.
Testing Overall Statistical Significance of the Model:- Likelihood
Ratio (LR) Approach
The LR test is based on the same concept as the F test in a linear
model.
The LR test is based on the difference in the log-likelihood
functions for the unrestricted and restricted models.
Cont’d
Because MLE maximizes log-likelihood function, dropping
variables generally leads to a smaller - or at least no larger log-
likelihood (This is similar to the fact that the R-squared never
increases when variables are dropped from a regression.)
The question is whether the fall in the log-likelihood is large
enough to conclude that the dropped variables are important.
We can make this decision once we have a test statistic and a set
of critical values.
The likelihood ratio statistic is twice the difference in the log-
likelihoods:
Cont’d
where Lur is log-likelihood value for the unrestricted model, and
Lr is the log likelihood value for the restricted model.
Because Lur is greater than or equal to Lr, LR is nonnegative and
usually strictly positive.
In computing LR statistic, it is important to know that Lur and Lr
can each be negative.
This does not change the way that LR is computed; we must
preserve the negative signs.
Contrary to linear regression model, there is no single measure
for the goodness-of-fit in binary response (choice) models.
Often, goodness-of-fit measures are implicitly or explicitly based
on comparison with a model that contains only a constant as
explanatory variable.
Cont’d
Let log denote maximum likelihood value of the model of
interest and let log denote maximum value of the log likelihood
function when all parameters, except the intercept, are set to
zero. Clearly, log L1 log .
The larger the difference between the two log likelihoods values,
the more the extended model adds to the very restrictive model.
Indeed, formal likelihood ratio(LR) test can be based on the
difference between the two values.
A first goodness-of-fit measure is defined as;
Cont’d
where N denotes the number of observations.
McFadden(1974) suggested an alternative measure;
and it is sometimes referred to as the Likelihood Ratio Index.
Because the log likelihood is the sum of log probabilities, it
follows that log L0 log L1 0from
, which it is straightforward to show
that both measures take on values in the interval[0,1] only.
If all estimated slope coefficients are equal to 0, we have
,such that R-squared is equal to zero.
Some example, Regression output
. logistic car income hhs
Logistic regression Number of obs = 40
LR chi2(2) = 30.14
Prob > chi2 = 0.0000
Log likelihood = -12.605647 Pseudo R2 = 0.5445
car Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
income 2.192885 .7278499 2.37 0.018 1.144168 4.20283
hhs .7904801 .3257098 -0.57 0.568 .3525019 1.77264
_cons .0000231 .0000847 -2.91 0.004 1.75e-08 .0305203
Note: _cons estimates baseline odds.
Probit
. probit car income hhs
Iteration 0: log likelihood = -27.675866
Iteration 1: log likelihood = -12.781611
Iteration 2: log likelihood = -12.383587
Iteration 3: log likelihood = -12.375829
Iteration 4: log likelihood = -12.375827
Iteration 5: log likelihood = -12.375827
Probit regression Number of obs = 40
LR chi2(2) = 30.60
Prob > chi2 = 0.0000
Log likelihood = -12.375827 Pseudo R2 = 0.5528
car Coef. Std. Err. z P>|z| [95% Conf. Interval]
income .4607914 .1890465 2.44 0.015 .0902671 .8313158
hhs -.1360354 .2501562 -0.54 0.587 -.6263325 .3542617
_cons -6.252617 1.99502 -3.13 0.002 -10.16279 -2.342449
THANK
U!