0% found this document useful (0 votes)
38 views9 pages

Logit and Probit Models Explained

This handout covers the Logit and Probit models, which are classification algorithms used for binary response variables. It explains the differences between these models and linear regression, emphasizing the use of Maximum Likelihood Estimation (MLE) to uncover unknown parameters. Additionally, the document provides examples of applying these models in data mining and machine learning, including steps for model building and evaluation.

Uploaded by

riakokate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views9 pages

Logit and Probit Models Explained

This handout covers the Logit and Probit models, which are classification algorithms used for binary response variables. It explains the differences between these models and linear regression, emphasizing the use of Maximum Likelihood Estimation (MLE) to uncover unknown parameters. Additionally, the document provides examples of applying these models in data mining and machine learning, including steps for model building and evaluation.

Uploaded by

riakokate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Handout #6

Title: Applied Data Mining & Machine Learning Spring/2025


Course: 220:422/219:531 Instructor: Dr. I-Ming Chiu

Reading: PMLR, Chapter 5, pp. 165~219

In this handout, we will learn another classification algorithm that is termed “Logit (or
Logistic) model”. We will also cover its cousin, the “Probit model”, that is only different
from the Logit model on one assumption. Both models are traditionally used as statistical
models in addition to their classification applications. The structure of the probit/logit
model is very similar to the Linear Regression model; however, the difference is the target
variable in Probit/Logit model is categorical (i.e., binary) instead of continuous. The
problems to uncover the unknown parameters in the Probit/Logit model is two folds; first,
the target variable is categorical with two levels (0 and 1). Secondly, we can no longer assume
the error term is normally distributed. To uncover the unknown parameters in these models,
we rely on the Maximum Likelihood Estimation (abbreviated as MLE) method (see
Appendix 01 at the end of this document for more information).

Linear Probability, Probit and Logit Model


Before exploring the application of logit/probit models in machine learning, it is important
to understand what they are. Logit and Probit models are statistical models used to analyze
the association between a binary response variable (i.e., a variable that can take one of two
values) and one or more independent variables (features). These models are similar to linear
regression models, but they are specifically designed for binary response variables. For
example, logit/probit models can be used to analyze whether a passenger on the Titanic
survived or not, or whether an applicant was admitted into the MSDS program at RU-
Camden.

The challenge when using a traditional linear regression model with a binary response
variable is that the model assumes that the response variable is continuous and normally
distributed, which is not the case for a binary target variable. Logit and Probit models
overcome this challenge by modeling the probability of the response variable taking a
particular value (usually encoded as 1) as a function of the independent variables. In other
words, logit/probit models estimate the likelihood of the binary response variable being 1
based on the values of the independent variables. This is done by transforming the binary
response variable into a continuous variable using a “LINK” function, such as the logistic or
probit function.

By using these models, we can determine the relationship between the independent variables
and the likelihood of the binary response variable taking the value of 1. This allows us to
make predictions about the binary response variable based on the values of the independent
variables, and to understand the factors that influence the outcome of interest.

1
Generalized Linear Models (GLM)

E(Y) = u; if Y is a continuous random variable, then we can denote its mean as “u”
X = 0 + 1*X (a linear predictor or also called a link function)
Recall the linear regression model: Y = E(Y) +  = 0 + 1*X + , where  ~ NIID(0, 2)

Can we equate the mean of a “categorical” variable to X (u = X) to form the model, u =
E(Y) = 0 + 1*X, just like what we did in the linear regression model?

Y = 0 + 1*X … Linear Probability model (1)

Y= 1…P
0…1–P

E(Y) = 1*P + 0*(1 – P) = P; Var(Y) = P * (1 – P)

The above equation states that the mean of Y is interpreted as the chance to observe “1”.
Problems with the linear link:
(1) 0  E(Y)  1, therefore the value of linear predictor must fall in this range
(2) Error term, , cannot be assumed to have a normal distribution

Therefore, a transformation on X is needed:


E(Y) ( P) = g(X) … we want to restrict the outcome of X between 0 and 1

Z
1  t2
P = ( Z ) 1 = 
 2
exp(
2
)dt … Probit model (2)

exp( Z )
P = ( Z ) = … Logit model (3)
1  exp( Z )

Where Z = 0 + 1*X

We’ll focus on how to interpret the results in both the Probit and Logit model.

Probit Model:
P
= * = Φ′ β β ∗ X * β (marginal effect) (4)
X

According to equation (4), the effect of one-unit change in X on P depends on two things:

1 The symbol “  ” represents the cumulative distribution function (CDF); it is same as the pnorm() command.

2
(1) The estimates of 0 and 1
(2) The level of X (which give rises to the probability in Φ′ β β ∗X .2

We will do the following exercise to get a better understanding on the concept of using
Probit model.

Logit Model:
The response of Logit model requires some transformation, so the interpretation of the
outcome is different from the Probit model. Here is the transformation procedure:

exp( Z ) P P
P (can be written as ) =  = exp(Z)  log( )=Z
1  exp( Z ) 1 P 1 P
P
log( ) = 0 + 1*X (3’)
1 P

P
Where the term in equation (3’) is called “Odds”; meaning if X increases by one unit
1 P
(X = 1), then the log of odds would increase by  units (or the Odds Ratio equals e).

Figure 1

Q: what does dnorm(0) (= 39.89%) mean exactly3? Answer: ; for example, what is the
effect on  if Z increases from 0 to 0.0001? pnorm(0.0001) – pnorm(0)/0.0001

2  Φ′ β0 β1 ∗ X ; the derivative of CDF, which is probability density function (PDF).


3 Recall the “likelihood” interpretation. Here is another interpretation.

3
Figure 2 (refer to equation (2) & (3) in pp. 2)

Example: A researcher is interested in how variables, such as GRE (Graduate Record Exam
scores), GPA (grade point average) and prestige of the undergraduate institution, effect
admission into graduate school. The response variable, admit/don't admit, is a binary
variable. In this example we only use one independent variable, GRE.
(Source: [Link]

(A) Linear Probability Model (LPM)


> da = [Link]("[Link]", header = T) #download the data directly from the web site by
indicating the url.
> attach(da)
> names(da)
[1] "admit" "gre" "gpa" "rank"
> dim(da)
[1] 400 4
> head(da, 3)
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1

da20 = da[1:20,]
[Link] = da20[da20$gre < 750, ] #remove some observations with large GRE score

4
Figure 3

Remove some observations


where the students have high
GRE results in a steeper
regression line (the red one).
The fitted probability, as the
green points reveals, can be
smaller than zero. On the
contrary, the fitted probability
can be greater than one for
large GRE scores.

> summary(da$gre)
Min. 1st Qu. Median Mean 3rd Qu. Max.
220.0 520.0 580.0 587.7 660.0 800.0
> LPM = lm(da$admit~da$gre) #Linear Probability Model
> summary(LPM)

Call:
lm(formula = admit ~ gre)

Residuals:
Min 1Q Median 3Q Max
-0.4755 -0.3415 -0.2522 0.5989 0.8966

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1198 0.1191 -1.007 0.3147
gre 0.0007 0.0002 3.7440 0.0002 *** 0.074% better chance to get
accepted if the GRE score is one point higher.
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4587 on 398 degrees of freedom


Multiple R-squared: 0.03402, Adjusted R-squared: 0.03159
F-statistic: 14.02 on 1 and 398 DF, p-value: 0.0002081

5
(B) Probit Model

> myprobit<- glm(admit ~ gre, family=binomial(link="probit"), data = da)#”glm” stands


for generalized linear regression model.
> summary(myprobit)

Call:
glm(formula = admit ~ gre, family = binomial(link = "probit"),
data = da)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.1583 -0.9072 -0.7551 1.3483 2.0114

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7682 0.3579 -4.940 7.82e-07 ***
gre 0.0022 0.0006 3.6980 0.0002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)

Null deviance: 499.98 on 399 degrees of freedom


Residual deviance: 485.99 on 398 degrees of freedom
AIC: 489.99
Number of Fisher Scoring iterations: 4

> ((accept.p = pnorm(-1.7682+0.0022*650)) #refer to equation (3)


#the probability to get accepted is 36.76% if the applicant has a GRE score of 650
[1] 0.3676062
> (marginal.p = dnorm(-1.7682+0.0022*650)*0.0022) #refer to equation (4)
[1] 0.0008288875
#the probability to get accepted increases by 0.083% if the GRE score increases by one
point given that the original GRE = 650

mylogit = glm(admit ~ gre, family=binomial(link="logit"), data = da)


summary(mylogit)

Call:
glm(formula = admit ~ gre, family = binomial(link = "logit"),
data = da)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.1623 -0.9052 -0.7547 1.3486 1.9879
6
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
gre 0.003582 0.000986 3.633 0.00028 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 499.98 on 399 degrees of freedom


Residual deviance: 486.06 on 398 degrees of freedom
AIC: 490.06

Number of Fisher Scoring iterations: 4

(accept.p = plogis(-2.901344+0.003582*650)) #refer to equation (3)


#the probability to get accepted is 36.05% if the applicant has a GRE score of 650
[1] 0.3605347

(marginal.p = dlogis(-2.901344+0.003582*650)*0.003582) #refer to equation (4)


[1] 0.0008258281
#the probability to get accepted increases by 0.083% if the GRE score increases by one
point given that the original GRE = 650

How large is the odds ratio?


Answer: e 0.003582 = 1.003588 … refer to equation (3’)

Apply Probit/Logit Model in Data Mining/Machine Learning

Step 1: Split the data into training/testing


Step 2: Build the Logit or Probit model using the training data.
Step 3: Plug in the values of features from the training/testing data to the estimated model
in order to compute the probability of success (“1”).
Step 4: If the probability is greater than 0.5 (the cut-off), classify the observation to the level
where the category is “1”. You can “tune” (i.e., adjust) the cut-off value based on some
criteria; e.g. larger recall value.
Step 5: Create the “Confusion Matrix” to examine the accuracy of the model in both the
training and test data. This matrix is used to compare the predicted outcome and the actual
outcome of the response variable in both data sets. You have to find a balance of model
performance in both training and test data.
Step 6: Improve the model by tuning the parameter in step 4 (or build a new model).

7
Appendix 01: Maximum Likelihood Estimation (MLE) Estimator

Likelihood function = L(X1, X2, … , Xn|); there can be more than one 

e.g. Toss a coin ten times and obtain 2 heads. What is the estimate of P?

 10 
Maximize L =   P2*(1 – P)8
2
P

Take the log function on both sides and we would like to

 10 
Maximize log(L) = 2* log P + 8*log(1 – P) [ignore log   because it is a constant]
P 2

Logarithm = a monotonic funciton.


We can generalize the above likelihood function as the following:

n n
Maximize L =   Pk*((1 – P)(n - k) ; drop the   term because it’s a constant.
k  k 

Take the log function on the above likelihood function and obtain:

Maximize log(L) = k*log(P) + (n-k)*log(1 – P)

The necessary condition to achieve optimization is:

dlog(L) 1 1 k
= k* - (n-k)* = 0  solve for P, P̂ =
dP P 1- P n

k
The above formula, P̂ = , is termed the Maximum Likelihood Estimator.
n
[Recall the estimator to uncover β in the simple linear regression model is: β = ; an
estimator derived using the Least Square Estimation method]

k 2
Plug in k = 2 and n = 10 from the numerical example, we can obtain P̂ = = = 0.2
n 10

Think: How do we estimate the Logit model using the MLE method?

8
Appendix 02: How Does the Logistic Classifier Work4?

A random sample of 30 observations were selected to illustrate how the Logistic Classifier
works

P = 0.2996

Note: the code to generate the above figure can be found in the “422_Appendix2.R” file.

4 The same concept can be applied to the Probit classifier.

Common questions

Powered by AI

The Linear Probability Model (LPM) gives predicted probabilities that can fall outside the range of 0 to 1, given its linear nature assumes continuous output . This can lead to nonsensical predictions, especially at extreme values of the independent variables. Probit and Logit models address these issues by applying link functions (probit or logistic) which ensure the probabilities are always between 0 and 1, by mapping the linear combination to the probability scale .

In Probit and Logit models, the Maximum Likelihood Estimation (MLE) method is used to estimate the model parameters. MLE finds the parameter values that make the observed data most probable under the model, given the transformation of the binary outcome to a probability measure . This involves maximizing a likelihood function in terms of these parameters .

The Probit and Logit models are similar in that both are used to analyze relationships between a binary response variable and one or more independent variables. They transform the binary response variable into a continuous variable using a link function, either the logistic or probit function, to model the probability of the response variable taking a particular value . The key difference lies in the assumption about the distribution of the error term: Logit uses a logistic distribution, while Probit uses a normal distribution .

In Probit models, coefficient interpretation is linked to the change in the probability density function for a one-unit change in the predictor variable, factoring in both the estimated coefficients and the level of X . In contrast, Logit model coefficients correlate to direct changes in the log odds of the probability for a unit change in the predictor, where the change is multiplicative in odds terms as evidenced by the odds ratio .

The transformation procedure for Probit models involves applying the normal cumulative distribution function (CDF) to convert the linear predictor into a probability. This involves the integral of the normal distribution . Meanwhile, the Logit model uses the logistic function to map the linear predictor to a probability, expressed as the log of the odds: log(p/(1-p)) = β0 + β1*X . This function ensures the outcome is confined between 0 and 1, calculated through a logistic distribution.

The link function in Generalized Linear Models (GLMs) is crucial for connecting the mean of the response variable to the linear predictor. In Probit and Logit models, the link function transforms the binary outcome into a continuous scale of probability, ensuring predicted probabilities remain between 0 and 1 . For the Probit model, this is the normal CDF, and for the Logit model, it is the logistic function . It defines how the expected value of the response is related to the linear predictor of independent variables.

In Probit models, the effect of a one-unit change in an independent variable on the probability of a binary outcome is determined by the marginal effect, which depends on the values of the estimates of the independent variables and the level of the independent variable itself. This is captured by the derivative of the CDF or the probability density function . Similarly, for Logit models, a unit change in an independent variable shifts the log of odds by the coefficient value, with the odds ratio being e raised to this coefficient .

In Logit models, the Odds Ratio represents the expected change in the odds of the response variable being 1 for a one-unit change in the independent variable. It is calculated by taking the exponential of the model coefficient (e raised to the β coefficient). This conveys how the odds of the binary response shifts with changes in the predictor variable.

The confusion matrix is used to evaluate Probit and Logit models by comparing the predicted binary outcomes against the actual outcomes, providing metrics like accuracy, precision, recall, and F1-score . Balancing model performance between the training and test datasets is essential because it ensures that the model generalizes well to unseen data, avoiding overfitting where a model performs well on training data but poorly on new data .

Traditional linear regression models assume a continuous and normally distributed response variable, which is unsuitable for binary response variables as they violate these assumptions . Logit and Probit models address this limitation by using link functions to transform the binary response variable into a continuous variable, allowing them to estimate the probability of the response variable belonging to a particular category .

You might also like