Logistic Regression:
Important instrument to take a
decision
PGP DSE BANGALORE
July 2018
1
What is Logistic Regression?
Regression relates the response to a set of predictors
When response is continuous and the predictors number more
than 1, the technique is Multiple Regression
In Multiple Regression predictors may be discrete or
continuous
What happens when response is Binary:
Among a group of loan applicant, whether a person is good
credit or bad?
Given an income level, whether a person will buy an
iPhone or not
2
More Examples?
• Software project completion in time: Yes/No
• Marketing: Given a price point whether an item be sold
• Finance, banking: Will a stock gain? Should I give loan to the
applicant
• CRM:
• Retail:
• Healthcare:
• Elsewhere?
3
German Credit Data
Description
Creditability: Whether a loan is Good (1) or Bad (0)
(Y: Response)
Credit Amount: Amount asked for in loan (in DM),
Predictor
Plot(Creditability versus Credit Amount)
4
Scatterplot
What to model?
How to model?
5
Logistic Regression
• Logistic regression is a statistical method for analyzing a dataset in
which there are one or more independent variables that determine an
outcome. The outcome is a dichotomous variable (in which there are
only two possible outcomes).
• The outcome/response only contains data coded as 1 (TRUE,
success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).
• The goal of logistic regression is to find the best fitting model to
describe the relationship between the dichotomous characteristic of
interest (dependent variable = response or outcome variable) and a
set of independent (predictor or explanatory) variables.
6
Response: Probability
Binary response : Y = Success or Failure (1 or 0)
Model:
π = Pr(Y = Success | X) : linear function of X (Predictor)
Possibility 0≤π≤1
π = α + 𝛽X
Prob(Good Credit) = α + 𝛽 (Credit Amount)
Will it work?
What is an obvious drawback?
7
Response: Logit(Probability)
π
logit (π) = log 1 − π , 0<π<1
logit (π) is a continuous function on the real line
π: Success Probability
π
= Odds of success
1 −π
logit (π) = log odds of success
Logistic regression:
logit (π) = α + 𝛽X
Logit (Probability of Y = 1) is being modeled as a linear function of the
predictors
8
Response: Logit (Probability)
Non-normal error variance
Non-constant error variance
No explicit error term associated with
the regression equation
9
Rationale
• Credit-worthiness depends on suggested predictors, e.g.,
amount of credit
• At different level combinations of the predictors a
randomly chosen credit applicant has different probability
of being a defaulter (i.e. being non credit-worthy)
• Hence π is a function of (X)
10
Shape of the Logistic Curve
For a single predictor
X-axis : Predictor (x)
Y-axis: π(x)
π(x), being a probability, bound
between (0, 1)
Coefficient of regression (α, 𝛽)
changes slope of the curve
11
Logit Functions
12
Credit-worthiness on Credit Amount
Does Credit-worthiness depend on Credit Amount?
To check
empirically
whether
creditworthiness
depends
on credit amount
13
Useful R Commands for Tabulation
cutpoint <- c(0, 500, 1000,1500, 2000, 2500, 5000, 7500, 10000, 15000,
20000)
Credit_cat <- with(German_Credit, cut(`Credit Amount`, cutpoint,
right=T))
table(Credit_cat)
Table1<-with(German_Credit, table(Credit_cat, Creditability))
Table2 <- prop.table(Table1,1)
Table3 <- cbind(Table2, table(Credit_cat))
round(Table3, 2)
14
Compare Creditworthiness
Creditworthy Margins of a table
Credit_cat 0 1
margin.table : Total
(0,500] 3 15
margin.table(, 1): Row margin
(500, 1000] 34 64
margin.table(, 2): Column
(1000, 1500] 51 139
margin
(1500, 2000] 33 93
(2000, 2500] 26 79
Proportions of a table
(2500, 5000] 75 200
prop.table(,1)
(5000, 7500] 34 68
prop.table(,2)
(7500, 10000] 20 26
(10000, 15000] 21 14
(15000, 20000] 3 2 15
Credit-worthiness across Amount (Cat)
Credit_cat Creditworthy
0 1 Total
(0,500] 0.17 0.83 18
(500, 1000] 0.35 0.65 98
(1000, 1500] 0.27 0.73 190
(1500, 2000] 0.26 0.74 126
(2000, 2500] 0.25 0.75 105
(2500, 5000] 0.27 0.73 275
(5000, 7500] 0.33 0.67 102
(7500, 10000] 0.43 0.57 46
(10000, 15000] 0.6 0.4 35
(15000, 20000] 0.6 0.4 5
16
Credit-worthiness across Amount (Cat)
0.85
Prop(Y=1) across Categorized Loan Amount
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
1 2 3 4 5 6 7 8 9 10
17
Credit-worthiness across Amount
(Cat)
Take a look
What are the odds of given credit at various levels of
asking amount?
How does it vary?
How does log odds vary?
18
19
Regressing Creditworthiness on
Amount
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.229e+00 1.083e-01 11.348 < 2e-16 ***
CreditAmt -1.119e-04 2.355e-05 -4.751 2.02e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1221.7 on 999 degrees of freedom
Residual deviance: 1199.1 on 998 degrees of freedom
AIC: 1203.1
Number of Fisher Scoring iterations: 4
20
Regressing Creditworthiness on
Amount
logit(π
ො) =
logit (𝑷𝒓(Y=1))
= 1.229 – 0.00012 CreditAmt
πƸ = exp(1.229 – 0.00012 Amt)ൗ[ 1+exp 1.229 – 0.00012Amt ]
21
Regressing Creditworthiness on
Amount
For every unit change in the predictor (Amount), probability of success
(probability of being credit-worthy) decreases by 0.00012 (0.012%) in the logit
scale
log(odds of success) is linear in the predictor
Convert back to original scale
Compute
Pr(credit-worthiness | Amount = 500, 600, 700)
Pr(credit-worthiness | Amount = 4000, 6000, 8000)
Pr(credit-worthiness | Amount = 15000, 20000, 25000)
What do you observe?
22
Odds Ratio
What is the odds of getting a loan if instead of applying for an
amount of 8000 you decide to apply for an amount of 6000?
Odds of getting loan = [ෝ
π/(1-ෝ
π) | amount = 8000] = 1.30
Odds of getting loan = [ෝ
π/(1-ෝ
π) | amount = 6000] = 1.66
1.27 times improvement
23
Odds Ratio
What is the odds of getting a loan if instead of applying for an
amount of 6000 you decide to apply for an amount of 4000?
Odds of getting loan = [ෝ
π/(1-ෝ
π) | amount = 6000] = 1.66
Odds of getting loan = [ෝ
π/(1-ෝ
π) | amount = 4000] = 2.11
1.27 times improvement
24
Logistic Regression Parameter (Slope)
From previous computations:
For a decrease of 2000 units in amount, odds increase by 1.27
For an increase of 2000 units in amount, odds decrease by 0.7866
= exp(-0.00012) = 0.99988
exp(β)
= 0.7866
exp(2000β)
log(odds) decreases by 0.00012 unit for every unit increase in loan amount
⇨ odds of being credit-worthy decreases by
exp(𝛽) = exp(0.00012) = 1.00012 times
25
Creditworthiness on Duration
Fit a logistic regression on Duration
What are the insights you get from the model?
What is the odds of loan being sanctioned when Duration = 12
months?
What is the odds of loan being sanctioned when Duration = 30
months?
What are the corresponding probabilities?
26
Multiple Logistic Regression
Call:
glm(formula = Creditability ~ ., family = binomial(link = logit),
data = German_Credit)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8249 -1.2734 0.7164 0.8533 1.5020
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.670e+00 1.466e-01 11.390 < 2e-16 ***
CreditAmt -2.300e-05 3.059e-05 -0.752 0.452
DurCredit -3.412e-02 7.282e-03 -4.685 2.8e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
27
Accuracy Measures
• True Positive TP: Correctly classified as Positive
• True Negative TN: Correctly classified as Negative
• Misclassification Probability
• Sensitivity TP Rate = TP/P = TP/(TP + FN)
• Specificity TN Rate = TN/N = TN/(TN + FP)
28
Accuracy Measures
Misclassification Probability
(Predicted)
FALSE TRUE
= (30 + 260)/1000 = 0.29
Creditability
0 40 260 Accuracy = 1 – 0.29 = 0.71
1 30 670
Sensitivity = 670/700 = 0.95
TP = 670 Specificity = 40/300 = 0.13
TN = 40
Of 300 F only 40 can be
properly predicted
Practical Implication?
29
30
31
ROC Curve
All possible combinations of sensitivity and specificity that can be
achieved by changing the cutoff value can be summarized by the area
under the ROC curve (AUC).
ROC plots 1 – Specificity (FPR) versus Sensitivity (TPR)
ROC summarizes predictive power for all possible values of p > 0.5.
The area under curve (AUC), referred to as index of accuracy or
concordance index, is a performance metric for ROC curve.
Higher the area under curve, better the prediction power of the model.
32
ROC Curve
The higher the AUC, the more accurate the test
An AUC of 1.0 means the test is 100% accurate (i.e. the curve is
square)
An AUC of 0.5 (50%) means the ROC curve is a straight diagonal
line, which represents the "ideal bad test", one which is only ever
accurate by pure chance.
When comparing two tests, the more accurate test is the one with an
ROC curve further to the top left corner of the graph, with a higher
AUC.
The best cutoff point for a test (which separates positive from negative
values) is the point on the ROC curve which is closest to the top left
corner of the graph.
The cutoff values can be selected according to whether one wants more
sensitivity or more specificity.
33
ROC & AUC: Credit Data
AUC = 62% AUC = 64%
34
ROC & AUC: Credit Data
AUC = 80%
35
Deviance
• A measure of goodness of fit of a generalized linear model.
• Higher the deviance value, poorer is the model fit.
• When the model includes only intercept term, then the
performance of the model is governed by null deviance. That is
the maximum deviance possible for a given set of data.
36
Null and Residual Deviance
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.666351 0.146615 11.365 < 2e-16 ***
DurCredit -0.037538 0.005703 -6.582 4.63e-11 ***
Null deviance: 1221.7 on 999 degrees of freedom
Residual deviance: 1177.1 on 998 degrees of freedom
With the loss of 1 Degree of Freedom residual deviance after
fitting one-predictor model is reduced by 44.6
Deviance ~ χ2 distribution
Reduction in deviance highly significant
37
Deviance for Model Comparison
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.670e+00 1.466e-01 11.390 < 2e-16 ***
CreditAmt -2.300e-05 3.059e-05 -0.752 0.452
DurCredit -3.412e-02 7.282e-03 -4.685 2.8e-06 ***
Null deviance: 1221.7 on 999 degrees of freedom
Residual deviance: 1176.6 on 997 degrees of freedom
With DurCredit only Residual deviance: 1177.1 on 998 degrees of
freedom
Inclusion of CreditAmt reduces Residual Deviance by 0.5
DF reduces by 1
2
Pr( χ (1) > 0.5) = 1 - 0.52 = 0.48 Highly non-significant
38
Deviance for Model Comparison
• Deviance comparison helps to identify parsimonious model in
a hierarchical comparison
• The more predictors are added, Residual Deviance is reduced
• If the reduction is significant, the predictor may remain in the
model
• If the reduction is not significant, the predictor need not be
included in the model
39
Deviance for Model Comparison
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.27788 0.32183 3.971 7.17e-05 ***
`Duration of Credit (month)` -0.04037 0.00586 -6.890 5.58e-12 ***
`Sex & Marital Status`2 0.13981 0.31948 0.438 0.6617
`Sex & Marital Status`3 0.68835 0.31199 2.206 0.0274 *
`Sex & Marital Status`4 0.45121 0.38021 1.187 0.2353
Null deviance: 1221.7 on 999 degrees of freedom
Residual deviance: 1162.9 on 995 degrees of freedom
Is there any need for including Sex & Marital Status?
40
Pseudo R2
• McFadden’s pseudo R2 compares log likelihood of null model
(intercept only model) and current model
• It does not have an interpretation of partitioning total variability
in the model
• Technically its value can be between 0 and 1 but practically it
does not reach 1
• Typically McFadden’s R2 between 0.2 – 0.4 may be considered
a good (acceptable) model
41
Pseudo R2
library(DescTools) ## A package for descriptive statistics
PseudoR2()
For the models considered
Predictors McFadden R2
Credit Amount 1.80%
Duration 3.60%
Credit Amount,
Duration 3.70%
Duration,
Sex/Marital Status 4.80%
42
Hosmer-Lemeshow Test
Goodness of Fit approach for model fitting: How well a model fits
depends on the difference between the model and the observed
data
Hosmer-Lemeshaw Test
library
Predictors Statistic P-value
(ResourceSelection)
Credit Amount 8.9 0.35
Duration 10.6 0.22
Credit Amount, Duration 11.14 0.19
Duration, Sex/Marital
Status 7.17 0.52
43
Model Selection
Backward Elimination:
glm(formula = Creditability ~ AcctBalance + DurCredit +
`Paymnt Status` + CreditAmt + Value + LengthEmpl + Instalment
+ SexMS + Guarantors + MVAA + ConcurrentCredits + Apt +
NoCredit + Telephone + ForeignWorker, family = binomial(link =
logit), data = German_Credit)
Do you think there is any scope for improvement?
44
Model Selection
Forward Selection:
glm(formula = Creditability ~ AcctBalance + DurCredit +
`Paymnt Status` + Value + Guarantors + Instalment + SexMS +
LengthEmpl + ConcurrentCredits + CreditAmt + ForeignWorker
+ Telephone + Apt + MVAA + NoCredit, family = binomial(link =
logit), data = German_Credit)
Do you think there is any scope for improvement?
45
Model Selection
Bothways:
glm(formula = Creditability ~ AcctBalance + DurCredit +
`Paymnt Status` + Value + Guarantors + Instalment + SexMS +
LengthEmpl + ConcurrentCredits + CreditAmt + ForeignWorker
+ Telephone + Apt + MVAA + NoCredit, family = binomial(link =
logit), data = German_Credit)
Do you think there is any scope for improvement?
46
Model Selection
What final model will you recommend?
Compare based on all the criteria considered
47
Indian Liver Patients Data
Recommend a model to understand which
Variables contribute significantly towards
Identification of a liver patient
48
• Train and test sample
• Cross-validation
• SMOTE
49
Data Split
• Data is split into 3 parts randomly
• Training Data: Several models are developed
• Validation Data: Prediction Error of Model Selection
• Test Data: Assessment of generalization of error of final model
50
Data Split
51
Data Split Proportion
• No golden standard exists
• Train proportion depends on complexity of the model
• Often data is not so large that 3-part split is possible
• Train : Test may be 70:30 / 80:20
• Train proportion should depend on model complexity
52
Cross-Validation
53
Cross-Validation
• Leave one out cross-validation (LOOCV)
Train the model on all observations except one
Find test error on that left-out observation
Final error rate is the average of all n errors
k-Fold cross-validation
Split data into k folds
Train the model on all folds except k-th
Repeat on all folds and compute average error rate
54
Cross-Validation
• K-fold CV depends on the split
• In k-fold CV, we train the model on less data than what is
available. This introduces bias into the estimates of test error.
• In LOOCV, the training samples highly resemble each other.
This increases the variance of the test error estimate.
• Training error rate sometimes increases as logistic regression
does not directly minimize 0-1 error rate but maximizes
likelihood.
Rule of thumb: Choose the simplest model whose CV
error is no more than one standard error above the
model with the lowest CV error.
55
Cross-Validation
Every aspect of the learning method that involves using
the data — variable selection, for example — must be
cross-validated.
• Divide the data into k folds.
• For i = 1, . . . , k:
• Using every fold except i, perform the variable selection and fit the
model with the selected variables.
• Compute the error on fold i.
• Average the k test errors obtained.
56
57
58
59
60
61
62