Econometrics 4
Econometrics 4
INDEPENDENT VARIABLES
STRUCTURE
4.0 Learning Objective
4.1 Introduction
4.10 Summary
4.11 Keywords
4.14 References
1
4.1 INTRODUCTION
Regression analysis often involves quantitative variables, such as monetary values, years of
experience, the proportion of persons that share a particular attribute, etc. However, there are
situations when we want to include qualitative factors. For instance, after accounting for
Does gender, marital status, or variations attribute to experience and education level matter?
people's wages be? Does race affect compensation or the likelihood of finding employment? When
the
Will the USA's trade patterns be significantly altered by the implementation of NAFTA1? Each of
these situations
The variable we're interested in is either category or qualitative, and several types of numerical
coding can be applied to it.
while it is not numerical in and of itself.
Using the dummy approach, these variables may be included in regression analysis.
variables. Although this approach is extremely broad, let's begin with the most straightforward
scenario, where the qualitative
The variable under consideration is binary, meaning that there are only two potential values (male
versus female, prior to NAFTA).
vs after NAFTA.
As a rule, binary variables are coded with the values 0 and 1. For example, we may
Create a gender dummy variable with a value of 1 for the men in our sample and a value of 0 for the
women.
dummy variable for NAFTA by putting a 0 for years before NAFTA and a 1 for years after it was
signed
in effect.
If you have a continuous dependent variable and several independent factors, you may use
regression analysis to make predictions about the dependent variable. Use logistic regression if your
dependent variable can be categorized into two categories. If the proportion of cases falling into
each of the two categories of the dependent variable is somewhat close to 50-50, then the findings
from logistic and linear regression will be comparable. Regression can be performed with either
continuous or categorical independent variables. When doing a regression analysis, it is possible to
employ independent variables with more than two levels if they are transformed into variables with
just two levels. This is an example of fake code, which will be described further on. Although
regression may be used with modified variables, it is most commonly employed with naturally
occurring variables. Remember that causal links among the variables cannot be identified using
regression analysis. Although we state that X "predicts" Y, we cannot claim that X "causes" Y because
of the way the phrase is constructed.
All the statistical methods we have developed so far have been for quantitative dependent variables,
measured on more-or-less continuous scales. The assumptions of linear regression—in particular,
that the mean value of the population at any combination of the independent variables be a linear
function of the independent variables and that the variation about the plane of means be normally
distributed—required that the dependent variable be measured on a continuous scale. In contrast,
because we did not need to make any assumptions about the nature of the independent variables,
2
we could incorporate qualitative or categorical information (such as whether or not a Martian was
exposed to secondhand tobacco smoke) into the independent variables of the regression model.
There are, however, many times when we would like to evaluate the effects of multiple independent
variables on a qualitative dependent variable, such as the presence or absence of a disease. Because
the methods that we have developed so far depend strongly on the continuous nature of the
dependent variable, we will have to develop a new approach to deal with the problem of regression
with a qualitative dependent variable.
To meet this need, we will develop two related statistical techniques, logistic regression in this
chapter and the Cox proportional hazards model in Chapter 13. Logistic regression is used when we
are seeking to predict a dichotomous outcome* from one or more independent variables, all of
which are known at a given time.* The Cox proportional hazards model is used when we are
following individuals for varying lengths of time to see when events occur and how the pattern of
events over time is influenced by one or more additional independent variables.
We need a way to estimate the coefficients in the regression model because the ordinary least-
squares criterion we have used so far is not relevant when we have a qualitative dependent variable.
We will use maximum likelihood estimation.
We need statistical hypothesis tests for the goodness of fit of the regression model and whether or
not the individual coefficients in the model are significantly different from zero, as well as
confidence intervals for the individual coefficients.
A dummy variable is a numerical variable used in regression analysis to represent subgroups of the
sample in your study. In research design, a dummy variable is often used to distinguish different
treatment groups. In the simplest case, we would use a 0,1-dummy variable where a person is
given a value of 0 if they are in the control group or a 1 if they are in the treated group.Dummy
variables are useful because they enable us to use a single regression equation to represent
multiple groups. This means that we don‘t need to write out separate equation models for each
subgroup. The dummy variables act like ‗switches‘ that turn various parameters on andoff in an
equation. Another advantage of a 0,1 dummy-coded variable is that even though it is a nominal-
level variable you can treat it statistically like an interval-level variable (if this made nosense to you,
you probably should refresh your memory on levels of measurement). For instance, if you take an
average of a 0,1 variable, the result is the proportion of 1s in the distribution.
3
Fig.- 4.1 Dummy Variable Technique
To illustrate dummy variables, consider the simple regression model for a posttest-only two- group
randomized experiment. This model is essentially the same as conducting a t-test on the posttest
means for two groups or conducting a one-way Analysis of Variance (ANOVA). The key term in the
model is b1, the estimate of the difference between the groups. To see how dummy variables
work, we‘ll use this simple model to show you how to use them to pull out theseparate sub-
equations for each subgroup. Then we‘ll show how you estimate the difference between the
subgroups by subtracting their respective equations. You‘ll see that we can pack an enormous
amount of information into a single equation using dummy variables. All I want to show you here is
that b1 is the difference between the treatment and control groups.
To see this, the first step is to compute what the equation would be for each of our two groups
separately. For the control group, Z = 0. When we substitute that into the equation, andrecognize
that by assumption the error term averages to 0, we find that the predicted value for the control
group is b0, the intercept. Now, to figure out the treatment group line, we substitute the value of 1
for Z, again recognizing that by assumption the error term averages to 0. The equation for the
treatment group indicates that the treatment group value is the sum of the two beta values.
4
Now, we‘re ready to move on to the second step – computing the difference between the
groups. How do we determine that? Well, the difference must be the difference between the
equations for the two groups that we worked out above. In other word, to find the
difference
between the groups we just find the difference between the equations for the two groups! It
should be obvious from the figure that the difference is b1. Think about what this means. The
difference between the groups is b1. OK, one more time just for the sheer heck of it. The
Whenever you have a regression model with dummy variables, you can always see how the
variables are being used to represent multiple subgroup equations by following the two steps
described above:
create separate equations for each subgroup by substituting the dummy values
find the difference between groups by finding the difference between their equations
Dummy variables assign the numbers ‘0’ and ‘1’ to indicate membership in any mutually exclusive
and
exhaustive category.
1. The number of dummy variables necessary to represent a single attribute variable is equal to the
number of levels (categories) in that variable minus one.
2. For a given attribute variable, none of the dummy variables constructed can be redundant. That is,
one dummy variable cannot be a constant multiple or a simple linear relation of another.
3. The interaction of two attribute variables (e.g., Gender and Marital Status) is represented by a
third
dummy variable which is simply the product of the two individual dummy variables.
A dummy variable (aka, an indicator variable) is a numeric variable that represents categorical data,
such as gender, race, political affiliation, etc.
5
Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small;
they can take on only two quantitative values. As a practical matter, regression results are easiest to
interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the
presence of a qualitative attribute, and 0 represents the absence.
The number of dummy variables required to represent a particular categorical variable depends on
the number of values that the categorical variable can assume. To represent a categorical variable
that can assume k different values, a researcher would need to define k - 1 dummy variables.
For example, suppose we are interested in political affiliation, a categorical variable that might
assume three values - Republican, Democrat, or Independent. We could represent political affiliation
with two dummy variables:
X1 = 1, if Republican; X1 = 0, otherwise.
X2 = 1, if Democrat; X2 = 0, otherwise.
In this example, notice that we don't have to create a dummy variable to represent the
"Independent" category of political affiliation. If X1 equals zero and X2 equals zero, we know the
voter is neither Republican nor Democrat. Therefore, voter must be Independent.
A series of data can often contain a structural break, due to a change in policy or sudden shockto
the economy, i.e., 1987 stock market crash. In order to test for a structural break, we often usethe
Chow test, this is Chow‘ first test (the second test relates to predictions). The model in effectuses
an F-test to determine whether a single regression is more efficient than two separate regressions
involving splitting the data into two sub-samples. This could occur as follows, where in the second
case we have a structural break at t.:
6
y
y
Model 1
Model 2
x t x
Case 1 Case2
This suggests that model 1 applies before the break at time t, then model 2 applies after the
structural break. If the parameters in the above models are the same, i.e. , then models 1 and 2 can
be expressed as a single model as in case 1, where there is a single regression line. The Chow test
basically tests whether the single regression line or the two separate regression lines fit the data
best. The stages in running the Chow test are:
1. Firstly, run the regression using all the data, before and after the structural break, collect
RSSc.
2. Run two separate regressions on the data before and after the structural break, collecting the
RSS in both cases, giving RSS1 and RSS2.
3. Using these three values, calculate the test statistic from the following formula:
RSSc (RSS1 RSS 2 ) / k
F
4. RSS1 RSS 2 / n 2k
5. Find the critical values in the F-test tables, in this case it has F(k,n-2k) degrees of freedom.
6. Conclude, the null hypothesis is that there is no structural break.
Multicollinearity
This occurs when there is an approximate linear relationship between the explanatory variables,
which could lead to unreliable regression estimates, although the OLS estimates are still BLUE. In
general, it leads to the standard errors of the parameters being too large, therefore the t- statistics
7
tend to be insignificant. The explanatory variables are always related to an extent andin most
cases, it is not a problem, only when the relationship becomes too big. One problem is that it is
difficult to detect and decide that it is a problem. The main ways of detecting it are:
The regression has a high R2 statistic, but few if any of the t-statistics on the explanatory
variables are significant.
Use of the simple correlation coefficient between the two explanatory variables in question
can be used, although the cut-off between acceptable and unacceptable correlation can be a
problem.
In statistics, an interaction may arise when considering the relationship among three or more
variables, and describes a situation in which the effect of one causal variable on an outcome
depends on the state of a second causal variable (that is, when effects of the two causes are not
additive). Although commonly thought of in terms of causal relationships, the concept of an
interaction can also describe non-causal associations. Interactions are often considered in the
context of regression analyses or factorial experiments.
The presence of interactions can have important implications for the interpretation of statistical
models. If two variables of interest interact, the relationship between each of the interacting
variables and a third "dependent variable" depends on the value of the other interacting variable. In
practice, this makes it more difficult to predict the consequences of changing the value of a variable,
particularly if the variables it interacts with are hard to measure or difficult to control.
An interaction in statistics refers to a scenario where the impact of one causal variable on an
outcome relies on the state of a second causal variable and may occur while studying the link among
three or more variables (that is, when effects of the two causes are not additive). [1][2] Even while
causal linkages are frequently conceived of in terms of interactions, non-causal correlations can also
8
be described by interactions. Regression analysis and factorial experiments frequently take
interactions into account.
Interactions can have significant effects on how statistical models should be interpreted. When two
variables of interest interact, each interacting variable's connection with a third "dependent
variable" is based on the value of the other interacting variable. In actuality, this makes it more
challenging to forecast the effects of altering the value of a variable, especially if the factors with
which it interacts are challenging to measure or to control.
In social and health science research, the concept of "interaction" is closely connected to that of
"moderation": the interaction of an explanatory variable and an environmental variable implies that
the explanatory variable's influence has been changed or moderated by the environmental variable.
The binary factor A and the quantitative variable X interact (are non-additive) when analyzed with
respect to the outcome variable Y.
Thus, for a response Y and two variables x1 and x2 an additive model would be:
In contrast to this,
is an example of a model with an interaction between variables x1 and x2 ("error" refers to the
random variable whose value is that by which Y differs from the expected value of Y; see errors and
residuals in statistics). Often, models are presented without the interaction term
, but this confounds the main effect and interaction effect (i.e., without specifying the interaction
term, it is possible that any main effect found is actually due to an interaction).
An interaction effect is the simultaneous effect of two or more independent variables on at least one
dependent variable in which their joint effect is significantly greater (or significantly less) than the
9
sum of the parts. It helps in understanding how two or more independent variables work in tandem
to impact the dependent variable.
It is important to understand two components first– Main Effects and interaction effects.
When the impact of one variable relies on the value of another, we speak of an interaction effect. In
regression models, ANOVA, and well-constructed studies, interaction effects frequently arise. In this
piece, I'll go through what interaction effects are, how to test for them, how to understand
interaction models, and the potential pitfalls of ignoring them altogether.
Many factors can alter the results of any experiment, whether it's a taste test or production process
research. The results are highly sensitive to these factors being altered. Changing the condiment
used in a taste test, for instance, may have a significant impact on the participants' enjoyment of the
meal. Analysts use models in this way to evaluate the strength of the association between
independent and dependent variables. The term "primary effect" is used to describe this type of
result. Even while identifying primary impacts is usually a simple process, focusing solely on those
variables may be misleading.
The independent variables may interact with one another in increasingly intricate research domains.
When a third factor enters into the equation between a given independent and dependent variable,
we say that there is an interaction effect. The connection between an independent and dependent
variable shift based on the value of a third variable, leading statisticians to conclude that these
variables interact. However, if the real world acts in this way, it is essential to include this kind of
influence in your model. As we'll see in this essay, the link between condiments and enjoyment likely
varies with the type of cuisine.
To illustrate the use of interaction effects with categorical independent variables, consider the
following
To me, interaction effects may be thought of as a "it depends" type of impact. See, now you know
why! To begin conceptualizing these impacts in an interaction model, let's look at an intuitive
example.
Main Effects:
Main Effects is the effect of single independent variable on dependent variable — ignoring the
effect of all other independent variables.
Interaction Effect:
10
As mentioned above, the simultaneous effect of two or more independent variables on at least one
dependent variable in which their joint effect is significantly greater (or significantly less) than the
sum of the parts.
I will limit this article to discussing about interaction between two variables.
1. Categorical variables
2. Continuous variables
3. One categorical and one continuous variable
For each of these scenarios, the interpretation would vary slightly.
Imagine someone is trying to lose weight. Weight Loss could be a result of exercising or
following a diet plan or due to both working in tandem.
11
The above numbers indicate weight loss in kg.
What does the above result indicate?
a) It shows that exercising alone is more effective than diet plan and results in 5 kg weight loss
b) Only exercising causes more weight loss as compared to a scenario when both exercising
and diet plan are followed together (Your diet plan is not working :))
What does the above result indicate?
It shows that the weight loss is higher when exercising and diet plan are implemented together.So,
we can say that there is an interaction effect between exercising and diet plan.
Let us view a Regression equation showing both main effect and interaction effect components.Y =
β0 + β1* X1 + β2*X2 + β3* X1X2
The above equation is interpreted as follows:
a) β1 is the effect of X1 on Y when X2 equal to 0 i.e., one unit increase in X1 causes β1
unit increase in Y, when X2 equals 0.
b) Similarly, β2 is the effect of X2 on Y when X1 equal to 0 i.e., one unit increase in X2
causes β2 unit increase in Y, when X1 equals 0.
c) In case, neither X1 nor X2 is zero, the effect of X1 on Y depends on X2 and the effect ofX2
on Y depends on X1.
To make it clearer, let us rewrite the above equation in another format.
Y = β0 + (β1 + β3* X2) X1 + β2*X2
=> Y= β0 + β1* X1 + (β2 + β3* X1)X2
=> (β1 + β3* X2) is the effect of X1 on Y and it depends on the value of X2
=> (β2 + β3* X1)is the effect of X2 on Y and it depends on the value of X1
Please note that this article has been written w.r.t to inputs/variables used for Market Mix
Modeling. The above concept is a likely scenario for MMM where the inputs could have a zero
value.
For a scenario where input variables cannot be zero, some other measures are taken. Anexample
could be a model where a person‘s weight is considered as one of the regressors. A person‘s weight
cannot be zero :)
The interaction between one categorical variable and one continuous variable is similar to two
continuous variables.
Let‘s go back to our regression equation:
Y = β0 + β1* X1 + β2*X2 + β3* X1X2
Where X1 is categorical variable, say (Female = 1, Male = 0)
And X2 = Continuous variable
When X1 = 0, Y = β0 + β2*X2
12
=> One unit increase in X2 will cause β2 units increase in Y for males
When X1 = 1, Y = β0 + β1 + (β2 + β3)*X2
=>One unit increase in X2 will cause β2 + β3 units increase in Y for females Effect
of X2 on Y is higher for females than males (Please refer figure 1 below)
Let‘s take two categorical variables — seasonality and some launch of product.
Assume that both seasonality and launch of product have a positive relationship with sales.
Seasonality and product launch in their individual capacity will lead to sales. If there is an
interaction effect between them, this might lead to incremental sales.
Y = β0 + β1* Seasonality + β2*Product launch + β3* Seasonality * Product Launch
=> Y = β0 + β1 + β2 + β3
where Seasonality and Product Launch = 1
In case there is no interaction, Y = β0 + β1 + β2
13
Y = β0 + β1* TV Ad + β2*Digital Ad - β3* TV Ad * Digital Ad -> Negative interaction term
If the interaction term is negative, the interaction component takes away some part of Sales thus
reducing the overall sales. In this scenario, it is suggested not to run both the campaigns
simultaneously as it takes away the sales. (Your campaigns are creating confusion among
customers: P)
Note that the main effects of these two inputs is positive but the combined effect has a negative
Beta value resulting in reduction in total sales.
In statistics, linear regression is a linear approach for modelling the relationship between a scalar
response and one or more explanatory variables (also known as dependent and independent
variables). The case of one explanatory variable is called simple linear regression; for more than one,
the process is called multiple linear regression. This term is distinct from multivariate linear
regression, where multiple correlated dependent variables are predicted, rather than a single scalar
variable.
In linear regression, the relationships are modeled using linear predictor functions whose unknown
model parameters are estimated from the data. Such models are called linear models. Most
commonly, the conditional mean of the response given the values of the explanatory variables (or
predictors) is assumed to be an affine function of those values; less commonly, the conditional
median or some other quantile is used. Like all forms of regression analysis, linear regression focuses
on the conditional probability distribution of the response given the values of the predictors, rather
than on the joint probability distribution of all of these variables, which is the domain of multivariate
analysis.
14
Linear regression was the first type of regression analysis to be studied rigorously, and to be used
extensively in practical applications. This is because models which depend linearly on their unknown
parameters are easier to fit than models which are non-linearly related to their parameters and
because the statistical properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the following two broad
categories:
If the goal is prediction, forecasting, or error reduction,[clarification needed] linear regression can be
used to fit a predictive model to an observed data set of values of the response and explanatory
variables. After developing such a model, if additional values of the explanatory variables are
collected without an accompanying response value, the fitted model can be used to make a
prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the
explanatory variables, linear regression analysis can be applied to quantify the strength of the
relationship between the response and the explanatory variables, and in particular to determine
whether some explanatory variables may have no linear relationship with the response at all, or to
identify which subsets of explanatory variables may contain redundant information about the
response.
Linear regression models are often fitted using the least squares approach, but they may also be
fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least
absolute deviations regression), or by minimizing a penalized version of the least squares cost
function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least
squares approach can be used to fit models that are not linear models. Thus, although the terms
"least squares" and "linear model" are closely linked, they are not synonymous.
For a relationship between a response variable (Y) and an explanatory variable (X), different
linear relationships may apply for different ranges of X. A single linear model will not provide an
adequate description of the relationship. Often a non-linear model will be most appropriate inthis
situation, but sometimes there is a clear break point demarcating two different linear relationships.
Piecewise linear regression is a form of regression that allows multiple linear models to be fitted to
the data for different ranges of X.
The regression function at the breakpoint may be discontinuous, but it is possible to specify the
model such that the model is continuous at all points. For such a model the two equations for Y
need to be equal at the breakpoint. Non-linear least squares regression techniques can be used to
fit the model to the data.
Linear regression is a method for modelling the correlation between a scalar response and a set of
predictors that are all assumed to be linearly related (also known as dependent and independent
variables). A basic linear regression is performed when there is only one explanatory variable,
15
whereas a multiple linear regression analysis is performed when there are numerous explanatory
variables. This word is used in contrast to multivariate linear regression, which predicts several
dependent variables that are all interrelated.
Linear regression is a method for modelling relationships by estimating the unknown model
parameters using the available data. Linear models are one kind of statistical representation. The
conditional mean of the answer is often employed since it is expected to be an affine function of
the explanatory factors (or predictors), while the conditional median or another quantile is used
less frequently. Linear regression, like other types of regression analysis, is concerned with the
probability distribution of the answer given the values of the predictors, as opposed to the
combined probability distribution of all of these variables, which is the province of multivariate
analysis.
Among the many types of regression analysis, linear regression was the first to be researched in
depth and put to widespread use in the real world. This is because it is simpler to fit linear models
to data and to assess the statistical features of the resulting estimators than it is to fit non-linear
models to data.
The field of linear regression has several applications. Most software may be classified as either of
these two types:
To fit a predictive model to an observed data set of values of the response and explanatory
variables, linear regression is commonly employed when the objective is prediction, forecasting, or
error reduction. Once such a model has been developed, it may be used to predict the answer if
new values of the explanatory variables are gathered without a corresponding response value.
To determine whether some explanatory variables may have no linear relationship with the
response at all, or to identify which subsets of explanatory variables may contain redundant
information, linear regression analysis can be applied if the goal is to explain variation in the
response variable that can be attributed to variation in the explanatory variables.
Although least squares is the most used method for fitting linear regression models, there are
alternative methods that may be used instead. For example, least absolute deviations regression or
ridge regression (L2-norm penalty) and lasso regression can be used to build linear regression
models (L1-norm penalty). Alternatively, non-linear models can be fitted using the least squares
method. Therefore, although "least squares" and "linear model" are often used interchangeably,
they are not the same thing.
There is often no linearity in real-world data. Fitting a line and obtaining a perfect model on
nonlinear and non-monotonic datasets is notoriously challenging. Though sophisticated models
such as SVM, Trees, and Neural Networks are available, they often come at the expense of being
easily explained and interpreted.
When the choice boundaries are not convoluted, is there a compromise that can be made?
16
Clearly, it's all in the title Piecewise regression divides a data set into smaller, independently
analyzed pieces, and then applies a linear regression to each subset. The points at which two parts
separates are known as break points.
Using a small dataset for demonstration, we will plot the results of a Linear and Piecewise linear
regression analysis.
To solve a problem, piecewise searches for a distribution of breakpoints that results in the smallest
sum of squared errors. Minimum sum of squared errors are achieved by employing least squares
fitting inside the critical region. Finding the best places to make cuts quickly while dealing with an
issue that spans several segments can be expedited through the use of a multi-start gradient-based
search.
17
In highly regulated business situations such as credit decisions and risk-based simulation, where
model explain-ability is necessary, a piecewise linear function is utilized to eliminate model bias by
segmenting on critical decision factors.
Piecewise Function: How to Use It?
The assumption of linearity between independent and dependent variables is crucial to the linear
regression model. To partition nonlinear variables into linear decision boundaries, a piecewise
model can be used inside a final linear model.
Piecewise independent nonlinear variables are segmented into intervals, and the properties of
these intervals are then included independently into linear regression models.
Non-linearity may be handled in a number of ways, including the use of polynomial functions,
however in order to describe variables with complex structure, one is often left with characteristics
of higher degree polynomials. This might cause the models to become unstable.
Assumptions-
Standard linear regression models with standard estimation techniques make a number of
assumptions about the predictor variables, the response variables and their relationship. Numerous
extensions have been developed that allow each of these assumptions to be relaxed (i.e., reduced
to a weaker form), and in some cases eliminated entirely. Generally, these extensions make the
estimation procedure more complex and time-consuming, and may also require more data in order
to produce an equally precise model.
The following are the major assumptions made by standard linear regression models with standard
estimation techniques (e.g., ordinary least squares):
Weak exogeneity. This essentially means that the predictor variables x can be treated as fixed
values, rather than random variables. This means, for example, that the predictor variables are
assumed to be error-free—that is, not contaminated with measurement errors. Although this
assumption is not realistic in many settings, dropping it leads to significantly more difficult errors-
in-variables models.
Linearity. This means that the mean of the response variable is a linear combination of the
parameters (regression coefficients) and the predictor variables. Note that this assumption is much
less restrictive than it may at first seem. Because the predictor variables are treated as fixed values
(see above), linearity is really only a restriction on the parameters. The predictor variables
themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying
predictor variable can be added, each one transformed differently. This technique is used, for
example, in polynomial regression, which uses linear regression to fit the response variable as an
arbitrary polynomial function (up to a given degree) of a predictor variable. With this much
flexibility, models such as polynomial regression often have "too much power", in that they tend to
overfit the data. As a result, some kind of regularization must typically be used to prevent
unreasonable solutions coming out of the estimation process. Common examples are ridge
18
regression and lasso regression. Bayesian linear regression can also be used, which by its nature is
more or less immune to the problem of overfitting. (In fact, ridge regression and lasso regression
can both be viewed as special cases of Bayesian linear regression, with particular types of prior
distributions placed on the regression coefficients.)
Constant variance (a.k.a. homoscedasticity). This means that the variance of the errors does not
depend on the values of the predictor variables. Thus, the variability of the responses for given
fixed values of the predictors is the same regardless of how large or small the responses are. This is
often not the case, as a variable whose mean is large will typically have a greater variance than one
whose mean is small. For example, a person whose income is predicted to be $100,000 may easily
have an actual income of $80,000 or $120,000—i.e., a standard deviation of around $20,000—
while another person with a predicted income of $10,000 is unlikely to have the same $20,000
standard deviation, since that would imply their actual income could vary anywhere between
−$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the
assumption of normally distributed errors fails—the variance or standard deviation should be
predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity
is called heteroscedasticity. In order to check this assumption, a plot of residuals versus predicted
values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e.,
increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the
absolute or squared residuals versus the predicted values (or each predictor) can also be examined
for a trend or curvature. Formal tests can also be used; see Heteroscedasticity. The presence of
heteroscedasticity will result in an overall "average" estimate of variance being used instead of one
that takes into account the true variance structure. This leads to less precise (but in the case of
ordinary least squares, not biased) parameter estimates and biased standard errors, resulting in
misleading tests and interval estimates. The mean squared error for the model will also be wrong.
Various estimation techniques including weighted least squares and the use of heteroscedasticity-
consistent standard errors can handle heteroscedasticity in a quite general way. Bayesian linear
regression techniques can also be used when the variance is assumed to be a function of the mean.
It is also possible in some cases to fix the problem by applying a transformation to the response
variable (e.g., fitting the logarithm of the response variable using a linear regression model, which
implies that the response variable itself has a log-normal distribution rather than a normal
distribution).
Independence of errors. This assumes that the errors of the response variables are uncorrelated
with each other. (Actual statistical independence is a stronger condition than mere lack of
correlation and is often not needed, although it can be exploited if it is known to hold.) Some
methods such as generalized least squares are capable of handling correlated errors, although they
typically require significantly more data unless some sort of regularization is used to bias the model
towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this
issue.
Lack of perfect multicollinearity in the predictors. For standard least squares estimation methods,
the design matrix X must have full column rank p; otherwise, perfect multicollinearity exists in the
predictor variables, meaning a linear relationship exists between two or more predictor variables.
This can be caused by accidentally duplicating a variable in the data, using a linear transformation
19
of a variable along with the original (e.g., the same temperature measurements expressed in
Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such
as their mean. It can also happen if there is too little data available compared to the number of
parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of
this assumption, where predictors are highly but not perfectly correlated, can reduce the precision
of parameter estimates (see Variance inflation factor). In the case of perfect multicollinearity, the
parameter vector β will be non-identifiable—it has no unique solution. In such a case, only some of
the parameters can be identified (i.e., their values can only be estimated within some linear
subspace of the full parameter space Rp). See partial least squares regression. Methods for fitting
linear models with multicollinearity have been developed,[5][6][7][8] some of which require
additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly
zero. Note that the more computationally expensive iterated algorithms for parameter estimation,
such as those used in generalized linear models, do not suffer from this problem.
Beyond these assumptions, several other statistical properties of the data strongly influence the
performance of different estimation methods:
The statistical relationship between the error terms and the regressors plays an important role in
determining whether an estimation procedure has desirable sampling properties such as being
unbiased and consistent.
The arrangement, or probability distribution of the predictor variables x has a major influence on
the precision of estimates of β. Sampling and design of experiments are highly developed subfields
of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of
β.
In statistics and econometrics, particularly in regression analysis, a dummy variable is one that
takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may
be expected to shift the outcome. They can be thought of as numeric stand-ins for qualitative facts
in a regression model, sorting data into mutually exclusive categories (such as smoker and non-
smoker).
A dummy independent variable (also called a dummy explanatory variable) which for some
observation has a value of 0 will cause that variable's coefficient to have no role in influencing the
dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter the
intercept. For example, suppose membership in a group is one of the qualitative variables relevant
to a regression. If group membership is arbitrarily assigned the value of 1, then all others would get
the value 0. Then the intercept would be the constant term for non-membersbut would be the
constant term plus the coefficient of the membership dummy in the case of group members.
20
Dummy variables are used frequently in time series analysis with regime switching, seasonal
analysis and qualitative data applications.
All of the independent (X) variables in a regression analysis are interpreted numerically.
Comparable numerical variables include those with interval or ratio scales, such as "10 is twice as
much as 5" or "3 minus 1 = 2." On the other hand, you may wish to incorporate an attribute or
nominal scale variable,
as "Brand Name" or "Defect Type" in your research. Let's pretend you've discovered three distinct
flaws, and have labelled them 1, 2, and 3.
and '3'. In this context, the expression "three minus one" has no meaning... Subtracting Defect 1
from Defect 3 is not possible. The
To be clear, the numerical values used to describe the severity of each "Defect Type" are only
descriptive and have no bearing on the quality of the defects themselves.
individual or family-owned and operated. In this case, dummy variables are introduced to "trick"
the regression algorithm into producing accurate results.
Dummy variables are dichotomous variables derived from a more complex variable.
A dichotomous variable is the simplest form of data. For example, color (e.g., Black = 0; White = 1).
It may be necessary to dummy code variables in order to meet the assumptions of some analyses.
A common scenario is to dummy code a categorical variable for use as a predictor in multiple linear
regression (MLR).
For example, we may have data about participants' religion, with each participant coded as follows:
A categorical or nominal variable with three categories
ReligionCode
Christian 1
Muslim 2
Atheist 3
This is a categorical variable which would be inappropriate to use in this format as a predictor in
MLR. However, this variable could be represented using a series of three dichotomous variables
(coded as 0 or 1), as follows:
21
ReligionChristian Muslim Atheist
Christian 1 0 0
Muslim 0 1 0
Atheist 0 0 1
There is some redundancy in this dummy coding. For instance, if we know that someone is not
Christian and not Muslim, then they are Atheist.
So, we only need to use two of the three dummy-coded variables as predictors. More generally, the
number of dummy-coded variables needed is one less than the number of categories (k - 1, where
k is the original number of categories). If all dummy variables were used, there would be
multicollinearity.
Choosing which dummy variables to use is arbitrary, but depends on the researcher's logic. The
dummy variable not uses becomes the reference category. Then, this is the tricky part
conceptually, all other dummy variables will predict the outcome variable in relation to the
reference variable.
For example, if I'm particularly interested in whether atheism is associated with higher rates of
depression, then use the dummy coded variables for:
Christian (0 = Not Christian or 1 = Christian)
Muslim (0 = Not Muslim or 1 = Muslim)
If the regression coefficient for the Christian dummy coded variable is:
not significant, then whether someone is Christian vs. Atheist isn't related to their depression
significant and positive, then Christian people tend to be more depressed than Atheists
significant and negative, then Christian people tend to be less depressed than Atheists
If the regression coefficient for the Muslim dummy coded variable is:
not significant, then there whether someone is Muslim vs. Atheist isn't related to their depression
significant and positive, then Muslim people tend to be more depressed than Atheists
significant and negative, then Muslim people tend to be less depressed than Atheists
Alternatively, I may simply be interested to recode the data into a single dichotomous variable to
indicate, for example, whether a participant is Atheist (0) or Religious (1), where Religious category
consists of those who are either Christian or Muslim. The coding would be as follows:
Atheism 0
Religious 1
22
4.7 DUMMY VARIABLE TRAP
The Dummy variable trap is a scenario where there are attributes which are highly correlated
(Multicollinear) and one variable predicts the value of others. When we use one hot encoding
for handling the categorical data, then one dummy variable (attribute) can be predicted with the
help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy
variables. Using all dummy variables for regression models lead to dummy variable trap. So, the
regression models should be designed excluding one dummy variable.
For Example –
Let‘s consider the case of gender having two values male (0 or 1) and female (1 or 0). Includingboth
the dummy variable can cause redundancy because if a person is not male in such case thatperson is
a female, hence, we don‘t need to use both the variables in regression models. This will protect
us from dummy variable trap.
In statistical terms, a dummy variable can be used for qualitative data analysis, time series analysis,
and other purposes. This article will introduce you to the idea of a Dummy Variable Trap and provide
you with a basic grasp of the Dummy Variable Trap model. Dummy variables is another name for this
idea when used in a regression analysis.
Sub-sample analysis and investigation are reflected by dummy variables in the regression model.
Categories such as gender, age, height, and weight can be represented by "Dummy Variables," which
are assigned numeric values to serve as stand-ins. These quantitative and categorical Dummy
Variables are used in the regression model. Their values are either 0 or 1, and they are represented
by tiny integers. In a collection of categorical data, they show whether something is absent or
present, with zero indicating absence and one indicating presence.
Let's start with the definition of the Dummy Variable Trap. There is a correlation between the traits,
hence this is the case (multicollinear). In this particular setting, one variable can be used as a
predictor for the values of several others. In order to forecast one characteristic of a Dummy
Variable using the rest of the Dummy Variables, it is necessary to apply just one hot encoding when
working with a categorical data set. As a result, when only Dummy Variables are employed in
regression models, the scenario is known as a Dummy Variable Trap.
This is a common issue when doing straightforward linear regression. One common assumption in
statistics is that dependent variables are unchanging throughout time. On the other hand, both
continuous and categorical values can be assigned to independent variables.
The next thing to do is go through the differences between how Dummy Variables in regression
analysis are interpreted and how continuous variables in a linear model are interpreted.
When two or more dummy variables generated using one-hot encoding are significantly linked, a
phenomenon known as the Dummy Variable Trap arises (multi-collinear). As a result, it is challenging
to understand predicted coefficient variables in regression models, as one variable might be inferred
23
from the others. Simply put, due to multicollinearity, it is difficult to draw conclusions about the
influence of the dummy variables on the prediction model in isolation.
With one-hot encoding, a separate dummy variable is generated for each category variable to
indicate its existence (1) or absence (0). In this way, a categorical variable such as "tree species,"
which may take the values "pine" and "oak," might be represented as a dummy variable by
translating each value into a one-hot vector. This results in two columns, the first of which indicates
whether or not the tree is a pine, and the second whether or not it is an oak. If the tree in question is
of the species represented in a given column, that column will contain a 0; otherwise, it will include a
1. These two rows are multi-collinear because we can rule out oak and pine from the list of possible
tree types.
There are many types of data used in statistics, and regression models in particular must be able to
handle them all. Quantitative (numerical) or qualitative (description) information might be collected
(categorical). Regression models work well with numerical data, but we can't use categorical data
without first transforming it.
The label encoding process allows us to convert categorical features into numeric ones (label
encoding assigns a unique integer to each category of data). However, there are other factors that
make this technique less ideal. After label encoding, regression models typically employ a single hot
encoding. This allows us to generate as many new attributes as there are classes in the
corresponding category attribute; if the latter has n classes, then the former must also produce n
attributes. Dummy variables are qualities that provide no use. In regression models, dummy
variables stand in for actual categories of information.
Each attribute will be assigned a value of 0 or 1 to indicate the attribute's presence or absence in the
dummy variables that will be constructed using one-hot encoding.
When qualities are strongly linked (Multicollinear) and one variable predicts the value of others, a
situation known as the Dummy variable trap arises. We may anticipate the value of one dummy
variable (attribute) using the values of other dummy variables when we utilize one-hot encoding to
deal with the categorical data. It follows that there is a strong relationship between dummy
variables. The dummy variable trap occurs when regression models only use dummy variables. As a
result, while developing regression models, it is necessary to take into account the possibility that
one dummy variable may be ignored.
Just one example:
Let's take into account the scenario where gender can take on the values of either male (which
would be 0) or female (which would be 1) (1 or 0). Since if a person is not male, then they must be
female, including both the dummy variable and the gender dummy variable in regression models is
unnecessary. We will be safe from the dummy variable trap if we do this.
In linear regression models, to create a model that can infer relationship between features (having
categorical data) and the outcome, we use the dummy variable technique.
24
A “Dummy Variable” or “Indicator Variable” is an artificial variable created to represent an attribute
with two or more distinct categories/levels.
The dummy variable trap is a scenario in which the independent variables become multicollinear
after addition of dummy variables.
Multicollinearity is a phenomenon in which two or more variables are highly correlated. In simple
words, it means value of one variable can be predicted from the values of other variable(s).
In statistics, especially in regression models, we deal with various kind of data. The data may be
quantitative (numerical) or qualitative (categorical). The numerical data can be easily handled in
regression models but we can‘t use categorical data directly; it needs to be transformed in some
way.
For transforming categorical attribute to numerical attribute, we can use label encoding procedure
(label encoding assigns a unique integer to each category of data). But this procedure is not alone
that much suitable, hence, One hot encoding is used in regression models following label encoding.
This enables us to create new attributes according to the number of classes present in the
categorical attribute i.e., if there are n number of categories in categorical attribute, n new
attributes will be created. These attributes created are called Dummy Variables. Hence, dummy
variables are ―proxy‖ variables for categorical data in regression models.
These dummy variables will be created with one hot encoding and each attribute will have value
either 0 or 1, representing presence or absence of that attribute.
Dummy independent variables in regressions have been introduced and understood in previous
chapters. We have seen how testing for gender-based pay disparities may be done with 0/1
variables like Female (1 if female, 0 if male). These variables can only take on two possible values,
true or false. However, Y has functioned as a continuous variable throughout the analysis. That is,
the Y variable has always taken on many different values in all the regressions we have seen thus far,
from the first SAT score regression through the several earnings function regressions.
In this section, we'll look at models in which the dependent variable is a dummy or dichotomous
variable. Binary response, dichotomous choice, and qualitative response models are all names for
this type of structure.
Dummy dependent variable models necessitate quite advanced econometrics and are challenging to
manage with our standard regression methods. We offer the subject with a strong focus on intuition
and graphical analysis since that is in line with our pedagogical tenets. Also, the box model and its
associated error term are the main points of attention. Last but not least, we proceed to use Monte
Carlo simulation to justify the part played by randomness. Although the subject matter is still
challenging, we are confident that our method significantly improves comprehension.
25
Specifically, What Does a Model With a Dummy Dependent Variable Mean?
That's a simple question to answer. A dummy dependent variable model has a qualitative rather
than quantitative Y variable (sometimes called the response, left-hand side, or X variable).
One's yearly income is a numeric value that might be anything from zero to many millions of dollars.
To the same extent, the Unemployment Rate is a measurable statistic that can be calculated by
dividing the total number of jobless individuals by the total number of employed individuals in a
certain area (county, state, or nation). To convert this percentage, we use the following fraction:
(e.g., 4.3 or 6.7 percent). The relationship between unemployment and income can be represented
by a cloud of dots in a scatter diagram.
On the other hand, the decision to emigrate is qualitative, taking on the values 0 (do not emigrate)
or 1 (do emigrate) (do emigrate). If we were to plot Emigrate and the Unemployment Rate in each
county as a scatter diagram, we wouldn't see a cloud. One horizontal strip would show
unemployment rates in different counties for those who did not depart, and the other would show
the same data for people who did leave the country.
As a qualitative variable, your political affiliation can take on values such as 0 for Democrat, 1 for
Republican, 2 for Libertarian, 3 for Green, 4 for Other, and 5 for Independent. The precise figures are
not given. The mean and standard deviation of the values 0, 1, 2, 3, 4, and 5 have no significance.
Each political party's numerical value in a scatter plot of political affiliation and annual income would
be represented by a horizontal strip.
Binary choice models are commonly used when the qualitative dependent variable has precisely two
values (like Emigrate). An appropriate representation of the dependent variable here would be a
dummy variable with values of 0 and 1. A multiweapon, multinomial, or polychotomous model is
one in which the qualitative dependent variable can take on more than two values (such as Political
Party). Models with a qualitative dependent variable that can take on more than two values present
extra challenges in terms of interpretation and estimation. The topics discussed in those books are
outside the scope of this one.
Dummy variables, which only contain 1s and 0s, can likewise serve as the dependent variable. When
it assumes the value 1, it is considered a success. Consider the case of house ownership or mortgage
approval, where the dummy variable would be assigned the value 1 if the individual was a
homeowner and 0 otherwise. After that, it may be regressed on a number of different factors,
including both the typical continuous ones and additional dummy variables. This is how a scatterplot
for such a model might look like:
26
Fig. -4.9 LPM
Although this method, known as the Linear Probability Model (LPM), is commonly used for
estimation, it has a number of drawbacks when utilizing ordinary least squares (OLS). Since the
regression line does not provide a good fit to the data, the typical measures of this, such as the R2
statistic, cannot be relied upon. The technique has additional flaws as well:
First, any model estimated with the LPM method will suffer from heteroskedasticity.
Since the LPM assesses probabilities, and a probability larger than 1 does not exist, it is feasible that
the LPM will provide estimates that are both more than 1 and less than 0.
As seen in the following graphic, the error term in such a model is highly unlikely to be normally
distributed.
Fourthly, the most significant issue is that the model's variables are probably not related in a linear
fashion. This indicates that a new sort of regression line, such as a 'S' shaped curve, is required to fit
the data more precisely.
we have created and interpreted dummy independent variables in regressions. We have seen how
0/1 variables such as Female (1 if female, 0 if male) can be used to test for wage discrimination.
These variables have either/or values with nothing in between. Up to this point, however, the
dependent variable Y has always been essentially a continuous variable. That is, in all the regressions
we have seen thus far, from our first regression using SAT scores to the many earnings function
regressions, the Y variable has always taken on many possible values.
This chapter discusses models in which the dependent variable (i.e., the variable on the left-hand
side of the regression equation, which is the variable being predicted) is a dummy or dichotomous
variable. This kind of model is often called a dummy dependent variable (DDV), binary response,
dichotomous choice, or qualitative response model.
27
Dummy dependent variable models are difficult to handle with our usual regression techniques and
require some rather sophisticated econometrics. In keeping with our teaching philosophy, we
present the material with a heavy emphasis on intuition and graphical analysis. In addition, we focus
on the box model and the source of the error term. Finally, we continue to rely on Monte Carlo
simulation in explaining the role of chance. Although the material remains difficult, we believe our
approach greatly increases understanding.
That question is easy to answer. In a dummy dependent variable model, the dependent variable
(also known as the response, left-hand side, or Y variable) is qualitative, not quantitative.
Yearly Income is a quantitative variable; it might range from zero dollars per year to millions of
dollars per year. Similarly, the Unemployment Rate is a quantitative variable; it is defined as the
number of people unemployed divided by the number of people in the labor force in a given location
(county, state, or nation). This fraction is expressed as a percentage (e.g., 4.3 or 6.7 percent). A
scatter diagram of unemployment rate and income is a cloud of points with each point representing
a combination of the two variables.
On the other hand, whether you choose to emigrate is a qualitative variable; it is 0 (do not emigrate)
or 1 (do emigrate). A scatter diagram of Emigrate and the county Unemployment Rate would not be
a cloud. It would be simply two strips: one horizontal strip for various county unemployment rates
for individuals who did not emigrate and another horizontal strip for individuals who did emigrate.
The political party to which you belong is a qualitative variable; it might be 0 if Democrat, 1 if
Republican, 2 if Libertarian, 3 if Green Party, 4 if any other party, and 5 if independent. The numbers
are arbitrary. The average and SD of the 0, 1, 2, 3, 4, and 5 are meaningless. A scatter diagram of
Political Party and Yearly Income would have a horizontal strip for each value of political party.
When the qualitative dependent variable has exactly two values (like Emigrate), we often speak of
binary choice models. In this case, the dependent variable can be conveniently represented by a
dummy variable that takes on the value 0 or 1. If the qualitative dependent variable can take on
more than two values (such as Political Party), the model is said to be multiresponse or multinomial
or polychotomous. Qualitative dependent variable models with more than two values are more
difficult to understand and estimate. They are beyond the scope of this book.
Figure gives more examples of applications of dummy dependent variables in economics. Notice that
many variables are dummy variables at the individual level (like Emigrate or Unemployed), although
their aggregated counterparts are continuous variables (like emigration rate or unemployment rate).
28
Fig. - 4.10 Dummy Dependent Variables
The careful student might point out that some variables commonly considered to be continuous, like
income, are not truly continuous because fractions of pennies are not possible. Although technically
correct, this criticism could be leveled at any observed variable and for practical purposes is
generally ignored. There are some examples, however, like educational attainment (in years of
schooling), in which determining whether the variable is continuous or qualitative is not so clear.
The definition of a dummy dependent variable model is quite simple: If the dependent, response,
left-hand side, or Y variable is a dummy variable, you have a dummy dependent variable model. The
reason dummy dependent variable models are important is that they are everywhere. Many
individual decisions of how much to do something require a prior decision to do or not do at all.
Although dummy dependent variable models are difficult to understand and estimate, they are
worth the effort needed to grasp them.
Dependent and Independent variables are variables in mathematical modeling, statistical modeling
and experimental sciences. Dependent variables receive this name because, in an experiment, their
values are studied under the supposition or demand that they depend, by some law or rule (e.g., by
a mathematical function), on the values of other variables. Independent variables, in turn, are not
seen as depending on any other variable in the scope of the experiment in question. In this sense,
some common independent variables are time, space, density, mass, fluid flow rate, and previous
values of some observed value of interest (e.g. human population size) to predict future values (the
dependent variable).
Of the two, it is always the dependent variable whose variation is being studied, by altering inputs,
also known as regressors in a statistical context. In an experiment, any variable that can be
attributed a value without attributing a value to any other variable is called an independent variable.
Models and experiments test the effects that the independent variables have on the dependent
29
variables. Sometimes, even if their influence is not of direct interest, independent variables may be
included for other reasons, such as to account for their potential confounding effect.
LOGIT MODEL-
A link function is simply a function of the mean of the response variable Y that we use as the
response instead of Y itself.
All that means is when Y is categorical, we use the logit of Y as the response in our regression
In this monograph, Dr. DeMaris begins by describing the logit model in the context of the general
loglinear model, moving its application from two-way to multidimensional tables. In the first half of
the book, contingency table analysis is developed, aided by effective use of data from the General
Social Survey for 1989.
As long as the variables are measured at the nominal or ordinal levels, the cross-tab format for logit
modeling works well. However, if independent variables are continuous, then the more
disaggregated logistic regression technique is favored. . . . A data example explores the relationship
30
of three continuous explanatory variables—population size, population growth, and literacy—to
the log odds of a high murder rate (in a sample of 54 cities). . . . Besides a comparative discussion of
the substantive interpretation of coefficients (odds versus probabilities), DeMaris describes
significance testing and goodness-of-fit measures for logistic regressions, not to mention the
modeling of nonlinearity and interaction effects.
In the final chapter, logistic regression is extended to dependent variables with more than two
categories, categories that may be either nominal or ordinal. The extension to polytomous logistic
regression allows researchers to forsake the inefficiency of ordinary regression in such a case, as
well as to avoid turning to discriminant analysis, with its unrealistic multivariate normal
assumption.
In sum, logit modeling achieves a general purpose, serving whenever the measurement
assumptions for classical multiple regression fail to be met, for either independent or dependent
variables. (PsycINFO Database Record (c) 2016 APA, all rights reserved)
Applications
In medicine, it allows to find the factors that characterize a group of sick subjects as
compared to healthy subjects.
In the field of insurance, it makes it possible to target a fraction of the customers who will
be sensitive to an insurance policy on a particular risk.
In the banking field, to detect risk groups when subscribing a loan.
In econometrics, to explain a discrete variable. For example, voting intentions in elections.
Probit Model
In statistics, a probit model is a type of regression where the dependent variable can take only two
values, for example married or not married. The word is a portmanteau, coming from probability +
unit. The purpose of the model is to estimate the probability that an observation with particular
characteristics will fall into a specific one of the categories; moreover, classifying observations
based on their predicted probabilities is a type of binary classification model.
A probit model is a popular specification for a binary response model. As such it treats the sameset
of problems as does logistic regression using similar techniques. When viewed in thegeneralized
linear model framework, the probit model employs a probit link function. It is most often estimated
using the maximum likelihood procedure,[3] such an estimation being called a probit regression.
Tobit Model
31
In statistics, a Tobit model is any of a class of regression models in which the observed range of the
dependent variable is censored in some way. The term was coined by Arthur Goldberger in
reference to James Tobin, who developed the model in 1958 to mitigate the problem of zero-
inflated data for observations of household expenditure on durable goods. Because Tobin's method
can be easily extended to handle truncated and other non-randomly selected samples, some
authors adopt a broader definition of the tobit model that includes these cases.
Tobin's idea was to modify the likelihood function so that it reflects the unequal sampling
probability for each observation depending on whether the latent dependent variable fell above or
below the determined threshold. For a sample that, as in Tobin's original case, was censored from
below at zero, the sampling probability for each non-limit observation is simply height of the
appropriate density function. For any limit observation, it is the cumulative distribution, i.e.
the integral below zero of the appropriate density function. The Tobit likelihood function thus isa
mixture of densities and cumulative distribution functions.
Applications
Tobit models have, for example, been applied to estimate factors that impact grant receipt,
including financial transfers distributed to sub-national governments who may apply for these
grants. In these cases, grant recipients cannot receive negative amounts, and the data is thus left-
censored. For instance, Dahlberg and Johansson (2002) analyze a sample of 115 municipalities (42
of which received a grant). Dubois and Fat tore (2011)[23] use a tobit model to investigate the role
of various factors in European Union fund receipt by applying Polish sub-national governments. The
data may however be left-censored at a point higher than zero, with the riskof mis-specification.
Both studies apply Probit and other models to check for robustness. Tobit models have also been
applied in demand analysis to accommodate observations with zero expenditures on some goods.
In a related application of Tobit models, a system of nonlinear Tobit regressions models has been
used to jointly estimate a brand demand system with homoscedastic, heteroscedastic and
generalized heteroscedastic variants.
Quantal answers and restricted responses are two useful categories for grouping together variables
that cannot be treated by regression analysis, the primary instrument of econometrics. The
methods of analysis known as probit and logit are applicable to dichotomous, qualitative, and
categorical outcomes, which fall within the quantal response (all or nothing) group. Decisions
include buying a home vs renting one, selecting a method of transportation, and selecting a career
path are all examples. Variables with both discrete and continuous outcomes fall under the
restricted response category, with tobit being the standard model and analysis tool for this kind of
data. Both negative and positive durable-goods spending samples exist, and both limit and non-
limit pricing data exist in models of markets with price ceilings. While the tobit model and the
restricted and quantal response approaches also have their roots in the probit model, they are
distinct enough from one another to warrant individual consideration.
32
4.10 SUMMARY
At times it is desirable to have independent variables in the model that are qualitative rather
than quantitative. This is easily handled in a regression framework. Regression uses qualitative
variables to distinguish between populations. There are two main advantages of fitting both
populations in one model. You gain the ability to test for different slopes or intercepts in the
populations, and more degrees of freedom are available for the analysis.
Regression with qualitative variables is different from analysis of variance and analysis of
covariance. Analysis of variance uses qualitative independent variables only. Analysis of covariance
uses quantitative variables in addition to the qualitative variables in order to account for
correlation in the data and reduce MSE; however, the quantitative variables are not of primary
interest and merely improve the precision of the analysis.
For modelling discrete outcomes, the Logit model is a popular choice. This can be used for either a
binary result (a value of 0 or 1) or for a more complex result with three or more possible values
(multinomial logit). The logit model is favored for large sample sizes since it runs under the logit
distribution (or Gumbel distribution).
Most Probit models are equivalent, especially when represented in binary (0 and 1). However, its
operation changes drastically when dealing with three or more results (usually ranking or ordering)
in this situation. With only one regression equation to work with, only "extreme" (top and bottom)
rankings may be used to draw any firm conclusions about marginal impacts. I'm happy to provide
further details if asked.
There is nothing similar about Tobit models. None of the possible outcomes are binary or discrete.
Linear regression is one form that Tobit models take. In particular, the Tobit model is employed for
regressing a CONTINUOUS dependent variable that has a unimodal distribution. The Tobit model
permits regression on a censored continuous dependent variable, making it possible to do such
regression. With this method, the analyst can keep the linear assumptions necessary for linear
regression but choose a lower (or higher) threshold at which to censor the regression. For additional
details, read my paper.
When the dependent variable in a regression model is a dichotomous event, one of these three
models is utilized.
The probit model is based on the lognormal distribution, while the logistic distribution is used in the
logit model.
Logistic distribution has rounder tails than lognormal.
Since stock returns typically have fat tails, the logistribution distribution is frequently used to analyse
their behaviour.
Probit theory is grounded in utility theory or the rational choice perspective on human behaviour.
Basic Econometrics by Gujarati is a good resource for learning more about these types of models.
For example, in adoption models (dichotomos dependent variable), the Logit and Probit models are
typically employed in the first hurdle, whereas the Tobit model is typically used in the second hurdle
in a double hurdle model. Here, "actual" values are the dependent variable instead of a simple
yes/no choice. Farmers in a certain area may be polled to determine if they will switch to using
33
hybridized maize seeds (answers: yes and no, then logit or probit models are used depending on the
distribution). Here we face our first challenge. If so, please provide the exact sum they are willing to
pay for this seed. The amount they will pay serves as the dependent variable in a Tobit model. The
second obstacle has therefore been removed. More information about adoption models is available.
4.11 KEYWORDS
Binary Variable- A binary variable is the same as a "truth value" in mathematical logic
or a "bit" in computer science. Similar to how statisticians refer to a Bell curve as a
"Normal Distribution" and physicists refer to it as a "Gaussian distribution," they are
really different names for the same thing.
34
_________________________________________________________________________________
___________________________________________________________________
3. Explain Logit Model in detail
_________________________________________________________________________________
___________________________________________________________________
A. Descriptive Questions
Short Questions
Long Questions
35
c. The coefficient of correlation is not dependent on both the change of scale and change of
origin
d. None of these.
d. None of these
a. regressor
b. regressed
c. predictand
d. estimated
a. correlation problem
b. association problem
c. regression problem
d. Qualitative problem
a. 0
b. 0.5
c. 1
36
d.1.5
Answers
4.14 REFERENCES
Gujarati, D., Porter, D.C and Gunasekhar, C (2012). Basic Econometrics (Fifth Edition)
McGraw Hill Education.
Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for Business and
Economics. 12th Edition, Cengage Learning India Pvt. Ltd.
Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third edition,
Thomson South-Western, 2007.
Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.
Ramanathan, Ramu, Introductory Econometrics with Applications, Harcourt Academic
Press, 2002 (IGM Library Call No. 330.0182 R14I).
Koutsoyiannis, A. The Theory of Econometrics, 2nd Edition, ESLB
37