0% found this document useful (0 votes)
184 views18 pages

DS535 Note 4 (With Marks)

Logistic regression extends linear regression to model categorical outcome variables. It relates predictor variables to the logit or log-odds of the outcome. The logit is a linear function of the predictors that can be mapped back to a probability between 0 and 1. An example uses demographic and bank customer data to predict the probability of accepting a personal loan using a single logistic regression model.

Uploaded by

nuthan manideep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
184 views18 pages

DS535 Note 4 (With Marks)

Logistic regression extends linear regression to model categorical outcome variables. It relates predictor variables to the logit or log-odds of the outcome. The logit is a linear function of the predictors that can be mapped back to a probability between 0 and 1. An example uses demographic and bank customer data to predict the probability of accepting a personal loan using a single logistic regression model.

Uploaded by

nuthan manideep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

DS 535: ADVANCIlD D,{TA MINING FOR BUSINESS

Lecture lrio I es fr4: tic Resression

(Textbook reading: Chapter l0)


!=v <- xt /
Xv, Xk

Loqistic Regression
o Extends idea oflinear regression to situation where outcome variable is categorical

Widely used, particularly $,here a strurtured model is useful to explain (=profiling) or to predict

\\'e locus on binary classification


i.e. I'=0 or )': I

The Logit
Goal: Find a function ofthe predictor variables that relates them to a 0/l outcome

o Instead o1')'as outcome variable (like in linear regression), we use a function ofY called
the logit
o Logit can be modeled as a linear fu:nction of the predictors
o The logit can be mapped back to a probability" which, in tum, can be mapped to a class

Step I : Losistic ltesponse Function


p = probability olbelonging to class I

a Need to relate p to predictors with a function that guarantees 0 . p . I

a Standard linear fi.rnction (as shown below) does not guarantees 0 < p < l:

p = 9o + 861 * B2x2+ ... Bnxn

The Fix:
we logislic response Junction
7
t |+ r+0zxz- -.Aqxe) prrL,"Lil:11
".-(lo+0i
*,^

1
h': a taiv (rt\
!, I
J
tt'f
gIil)'o'L r 11 )=
Steo 2: The Odds o.I '-t
The odds olan event are defined as: ol,d, {q Htil -r'I t
p
Odds
eq. 10.3 l-p cl oAh r,*ic
Or, given the odds ofan event, the probability ofthe event can be computed by:

I' blot cci )l


Odds t. (1)'a' L
' l+Odds p(H) 'o'& , t'
eq. 10.4
ofik {n Hul'-
o'3
0'> f
We can also relate the Odds to the predictors:
Ffit+ 2+"'+ Prx,
r OddS =
F tx

r] "Fo+
Step 3: Take los on both sides
This gir,es us the logit:
log(Odds) = 0o + B,x, - B,x., +...+ pqx(1

log(Odds) = /rrglt (eq. 10.6)


v ri b;t
So, the logit is a linear function ofpredictors xr, x2. ...
Takes values lrom -infinity to +infinity

l . a aD-.
--r--
aalr.t-a
(ttlw :
S -sla
t'
T lon',,|., {"t"liu
ol \

o2
't' ,t{^
ollcr {u nclms can be uu/ ,
x cDt
r, dot;i1 .fucf;,a )
f cunnlalx
-\L U:f)' irrl
\g
{(r)' {ir fln"n* ="'o
0 ---i-
,10f th
Odds (a) and Logit (b) as function ofP
1q)
g)
&
m
g 00

o
*
ao
30
a0
10
0
0 0,1 0.2 0.3 0.4 05 0.6 0.7 0.8 0.9
kobablllt o, sllcc.!.
(r)

0.1 0.2 0.5 0.6 07 0.8 ri,g

"3

Prob*illty ot Succaai
..___...1r)

Example
Personal Loan Offer (UniversalBank.csr')

Outcome variable: accept bank loan (0/l )

Predictors: Demographic info, and info about their bank relationship


Description ofPredictors for Acceptance olPersonal Loan Example

Age Customer's age in completed years


Experience Number of 1"ears ol prof'essional experience
Income Annual income of the customer ($000s)
Family Size Family sizc of the cuslorner
CCAvg Average spending on credit cards per month ($000s)
Education Education Level. l: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value ofhouse mortgage if any ($000s)
Securities Account Coded as I ilcustomer has securities accourit with bank
CD Account Coded as I ilcustomer has certificate ofdeposit (CD) account with bank
Online Banking Coded as I if customer uses Intemet banking facilities
Credit Card Coded as I ilcustomer uses credit card issued by Universal Bank

5
Sinele Model

Data Prep:
#### Table l0 .2
bank. df <- read. csv ( "UniversalBank. csv " )
head (-(bank. df )
> Gad ( bank. df )
tD Age Experience Incone ZI?.Code Family CCAvg Educatior- Morlgage PelsonaI.Loan Securities.Accolrnt CD.AccounE Online
r66itc.rd
125 1 49 91107 4 7-6 1 c 0 0

245 19 34 90C89 3 1.5 L 0 0

3 39 15 11 94120 1 1.0 I 3 0 0 0 c

435 9 100 94112 r 2.1 ) c 0 0 0

5 35 I 45 91330 4 1.0 2 ! 0 0 0 3

6 3'l 13 29 92L21 4 0.4 2 155 0 0 0 t


3N Ynvtt
I t
td(. tol
bank. df <- uanr. arl, -0c (r, 5)l # Drop ID and zip code in colunns 1 and 5
* Education is 1, 2, 3.
* want to convert it to cateqorical (R will create dummy variables)
bank. dfsEducation <- factor (bank. df$ Educatj-on, levels = c(1, 2, 3),
Iabeis = c ("Undergrad". "craduate",
"Advanced/Professj-onaln))
head (bank.df) Y -
Age Experience rncome Pamaly CcAvg Educatio n Mortgage ?e:sonaf-Loan Securities . Account CD.Acccunt Online Creditcard
1 25 r49 4 UtrderErad 0 0 1 0 0 0
2 45 19 34 3 1,5 undergrad 0 0 1 0 0 0
3 39 15 11 1 1.0 Undergrad 0 0 0 0 0 0
4 35 9 100 1 2.1 G!adL:ate 0 0 0 c 0 0
5 35 I 4s 4 1.0 G!aduate 0 0 0 0 1
6 3'l 13 29 4 0.4 Graduate 155 0 0 0 1 0

# partj-tion data
set. seed (2 ) # random sample can be reprociuced by settj-ng a value for seed
train. index <- sample (c (1 : din (bank.df ) [ 1] ), dim(bank. df ) t1l *0. 6) [o/, *a
train.df <- bank.df ltrain. index, ] nunhTv o1' Ydwt is h
valid.df <- bank.df [-train. index,
t ]

SinAle Predicl.or Model: Modeling loan acceptance on income (x)

Prob(Persanal Lrrln: l-rl: I hwmr: rt:' ---1-


I + €-(go+'tr)

ha r\ mu uvt 1" k.(: l'J e iiYale r


Fitted coefficients (more later): bo = -6.3525, br = -0.0392
P\Personal Loqn: l'es I Inronu: rf : * 6s"l-orsoz;

r!.. tYilrr; r^ 4
:
t n'r', Ss b :t
{ Y- I l
J,u,at r6taavtt r rl: hv',*t r
Seeing the Relationship

09
0.8
a7
J 0ai
0.5
o
04
03
02
01
0
C 50 tl:,o 150 200 250
lncom. {in 3000s}

Last step - classiiy


Model produces an estimated probability of being a " I "
a Convert to a classification by establishing cutoff level

o ifestimated prob. > cutoil, classify as "l"


WaYS to I)etem1 I llc C utoff
a 0.50 is popular initial choice

Additional consideratiorrs (see Chapter 5)


Maximize classifi cation accuracy
Maximize sensitivity (subject to min. tevet of specificity)
Minimize false positives (subject to max. false negative rate)
Minimize expected cost of misclassification (need to specify costs)

Estimation: Estimates of b's are derived through an iterative process called marimum likelihood
estimalion

Let's include all 12 predictors in the model now

In R use function glm (for general linear model) and


familY = "binomial"

Fitting the model

5
> # run foqistic regression
> * use glmo (general linear mociel) with family = "binomial" to fit a logistic
> * reqression. a.g[ xrr,,nl r,'in.r,
I ^u,,^lrkt
iu ( > logit.reg <- glm (Persona.l . Loan - ., data - train.df, family = "blnomial")
> opti-ons (scipe;-999) 4 t,lutn ,{t e noffin^ l
o\ it't t ,l^,,o
> sumnary (logit. reg)
*w= a
Y= 0

CaIl:
g1m(formula : Personal.Loan - famjly - "binomiaI", data = tlain.Cf)
Deviance ResiduaLs:
Min 1Q Median 3Q Max
-2.0380 -0.184? -0.0621 -0.0183 3.9810

Coeffic.ients: P4
Est imate sr d. Error z vaLue Pr(>lzl)
(Intercept) -72 .6845628 2 .2903310 -s. s37 0.0000000308
Age -0.0369346 0 .0848937 -0.435 0.66351
Experience 0 .04 90 64 5 0 .0844410 0. 581 0.56L2t
Incone 0 .0612953 0 .4o391 62 15.416 < 0.0000000000000002
t anrly 0 .5 434651 0 .a994936 3. az
t o. ooooooo4To
CCAvg 0 .2765942 0 .0601900 3. s99 0.00032
EducationGraduate .2681068 0 .370337t u. 525 < 0.0000000000000002
EducatioEdvanced/
.-- Profes s ional ; .4408154 0 .3123360 \.921 < 0 . 000 00 00 000 00002
0
Mortgage 0 .0c154 99 0 . o00'1 926 1.955 0.05052
Securities . Account -1 .L451416 0 .3955796 -2-896 0, 00377
CD. Account 4 .5855656 0 ,4'71'7 69e, 9.598 < 0. 0000000000000002
onllfi6 -0 .8588074 0 .279t211 -3.919 0.0000888005
C!editCard -1 .25t42t3 0 .2944'7 6'1 -4 -250 0 . 00 00214111

Signif. codes: 0 !***' 0.1 1

(Dispersion parameter for binomial fanily taken to be 1)


lpvianre-. IV ctal',3 *irn
fl
NUI] deviance: 1901.71 or\ 2999 degrees of freedom tl s!e l,r
Residual- devi.ance: 682.L9 on 298'1 degrees of freedom oU ,2r/tl t4
Arc: 70€'1e smxtle. tc tr lt$ru ALc. ->.t\L *, i(
Nunber of Fisher Scoring iteratj-ons: I tmdler
devl alre i-r
belkr
#To assess the relative importance of individual predictors in the model,
#we can also look at the absol,ute va1u6 of ttr6 t-statistic for
#each model parameter. This tecirn:que is utilized by the varlmp function
#in the caret package

warlnp (logit. reg)

names (1ogit. reg )


# the names availabfe in logit.reg, can be referred to by "l-ogit.reg$"

6
Evaluatine Classification Performance
Performance measures: Confusion matrix and 7o of misclassifications

# some libraries
insta1l. packages ("e107 L " )
Iibrary(e107L) * misc functions in stat, prob
install . packages ("Rcpp" )
library(Rcpp) # R and c++ integration
install. packages ( "caret" )
Iibrary(caret) # Classfication And REgression Training

# evaluate
+ predict (logit.reg, valid.df) assumes a Iinear regression prediction
# adding type = "response" means 1og (odds) is linear regression, hence
# the prob follows a logistic regression model
ttJc lcliskc ulxt"ta
p,j$l-.-_ntSg-r"t (Iogit.reg, valid.df , type = "response") to p,!l:r+i !^ w"N-ff A
* the same as
# predl <- predict (Iogit,reg, va.id.df[,-8], type : "response" l 'l^lt"
t
# evaluate r ctq,., n 8 r, ?eq^l ,L&,a
> # get the confusion Matrix usinl table tY)
> c.mat .- !3!] (ifeJ-se (pred > 0.5, 1, 0), valid.df[,8] )
> c.mat "
hclnnt i+ y{\&rl proL > o-5 o alodtl l't
01 ol\crntr. 't ott
0 L779 58
r d"kl 1 30 133

> this gives accuracy


# rh,i, (a
3. (d--q (c.mat) ) /sum(c.mat) # *1 of olxto,lt4l
t1l 0.9s6
> # number of 0 and 1 in valid.df Arfr^a4
> table (valid. df [, I ] )
01
*J',*rd
C t F-
1809 191 P
F't T+

sort (pred, decreasing = TRUE)

7
+ROC Curve and other capabiLitres of "Informationvalue"
instal.L packaqes (" InformationValue " )
Iibrary ( I n fo rmationVal-ue )
plotROC (vaIid.df [. 8] . pred)

ROC Curve ttLelvt( y"rtly ctavnclleulti c


I r?,
1 0!' plrt Ttl, v{s

t dt((w,"t 'ut'fi
t,
viohil"l+'t ?
0 7a
chtJiu'*;e;
u
o-
t
T., [. E'*
a
c
a0) 0
0 2i

02t it ti ,.r i]: lCO

61"ilrv l9 H.{ , 1-Specificiiy (FPR)


1-7. T. /"
" taltc rt&nrt,^t :" F.7
> confusicnMatrix(va1id.df [,8], pred, thresho.Id = 0.5)
01 A.'ltu{
1779 58 ile{*utf t"lr-,11 fL t q
dr+rd 1 30 133
? > o't --z c larlt'1 Oo I

> misClassErro.r (val.id.df [,8], pred, threshold = 0.5)


[1] 0.0114 ., \ - d(cu v;\ty
> # Compute the optimal probabiJ-i.ty cutoff score, based on a user
defined obl ective
l- at tu vn cl
> optimal-Cutqff (valld.df [,8] , pred, optimiseFor = "misclasse.rror",
+ returnDiagnostics = TRUE)
/,-l irrg f ,1,
Ztt/oi t1, a

Brl
$optimalCutoff
[1] 0 . {7 99{8{

$sensitivityrable F f\" t*f' T-7 l- Ac (,uv^ ,l


curoFF FPR TPR YOUDENSIIIDEX SPECIFICIU MISCIASSEFAOR
1 p.999943397 0 0.005235502 0,005235602 1 .0000000 0.0950
2 0. 989948397 0. 000552?916 0.2341885 0.203635590 0 .9994412 0.0 165
3 0 - 979948397 0.000552?916 0.3336649 0.303112130 0 .9994412 0.0 6'7 0
4 0. 969948397 0. 000552?916 0.3350785 a .3345257 42 0 .99944'7 2 0.0 644
5
o utput skip.oed

48 0.529948391 0.47326'7 0.6701571 0.556890070 0.9867330 0.0435


49 0.51994 3397 0.01431258 0. 6806283 0. 666255691 a .985627 4 0.0435
s0 0.509949397 0.01541816 0. 5963351 0. 680856914 0.9845218 0.0430
s1 0.49994339? 0.01658375 0. 6963351 0.6?9751331 0 . 98 34163 0.0440
52
s3
54
0 .4A9948391
q:.!299{8397
0.469948397
0.016583?5
o. 01?13654
0.01?68933
0.7c1570?
0. ?120i119
0 . ? 120419
0.684986933
0. 59,t905345
0.694352554
0 . 98
o.9a2e635
34163

0. 9823107
0.0435
0.0430
0.043s
a towrrF
55 0.459948397 o - 01824212 o."t!.12't'15 0.699035364 0.9817579 0. 04 35
ttn\e -clqil,1\
56 ,-,* 1.,
output skipped
9s 0.059943397 0.129906 0.9 c05236 0. ?7 0 617 535 0.8700940 0.12?0
w u\bt
96 0.049943391 0.1448314 0.9 162304 0.771398958 0.8551686 0.1390
97 0.03994939? 0- 1558375 0. 92i455 0.r55628489 0.834162s 0.1575 ^ccuydrl
98 0.029948397 0.2012161 0 2 6't 0t6 a.125485429 0.7987839 0.1890
99 0.019948397 0.2399116 0 424084 a,-t02496824 0.7600884 0.222s
100 0.009943397 0.3228303 0 .9685864 0.645756094 o.6'7'7169'1 0.2950

Srni sclassi f icationError


t1l 0.043

$TPR
ltl 0.172a419 1t /o
$ FPR
[1] 0.017136s4 tt/"
$ Spec j- f ic j- ty

[1] 0.9828635 1-1"

+ The maximization criterion for which probability cutoff score needs tc be optimised.
Can take either of following values: "Ones" or "Zeros" or "Both,'or
>* "misclasselror" (default). If "O!e9" is used, 'optimalcutoff' will be chosen to
maximise detection of "One's". If 'Bcth' is specified, the probabilj.ty cut-off
1t1,
> + that gives maximum Youden's Index is chosen. If'misclasserror' is specified,
> * the prooability cut-off that gives roinimum mis-claaificatioo error is chosen.

9
Convertine to Probabilifv

Odds
p
l+ Odds

Function predict does the convenion from logit to probabilities

> +#+# Table 10.3

> # use predictO with type = "response" to compute predicted probabifities,


> logit.reg.p:ed <- predict (Iogit . reg, valid.df[, -8], type = "response")
*tc rtt nG * Pv"4:
t t
> * fi rs-!_! actual and predicted records
i^ vi"d*tl-t
> data. frame ( actual : valid,dfSPersonal.Loan[1:5], predicted =
P
l^l^
1ogit. req . pre.i t1,51 )
actual predicted
2 0 0.00002?07663
6 0 0.00326343313
9 0 0.03966293189
10 1 0.98846040514
11 0 0.5993397479r
> # same as
> * data. frame (actuaf : vali"C.c.fsPersonal.Loan[1r5], predicted = psed[1:5])
lrJ( lolirl,lc flltr:b adl +' fd dr
> sort(pred, decreasj,ng = TRUE ) ," rorr.l-arrfiA d,*1.. ,,n/. 4vt't^!' Itn
295? 2813 424:)
P
783 4 311 10!.5 h!k!
235't 4I3 Ptp
0.9999483968208 0,9998?65031437
0. 9993s900s5815 0. 999t1 25259612
0 .999'1600A11462 0.9997511337985 0.9994943018554 0 .9993'7 46359214
hw,f
L412 3369 289 48 1936 9t2
204'7 L32
0.9991002979159 0.9990119318105 0.9989045045567 0.9988276036858 0.998?851429399 0.9986437054436
0. 9984859004053 0. 9984301560280
(ouqrut .Lit?I.d,

3080 150? 3825 2835 1101 1483


1515 2599
0-0000029329685 0.C000028524356 0.0000026253577 0.1000025433706 0.0000024437671 a .000442299161'1
0.000002 0775I 11 0. c000020074794
4111 4261 11e 94O 1531 3726
552 45t2
c,0000017?26571 0.0000016705920 0.0000c16452636 C.J000011837855 0.0000011409230 0.0000009971214
0.000000938 9t 17 0.0000007090812

10
lntemretinq Odds. Probabilitv
For predictive classification, we typicalll,use probability rvith a cutoffvalue

For explanatory purposes. odds have a useful interpretation:


If we increase xr by one unit, holding x2, x3 ... xq conslant, then br is the factor by which the
odds of belonging to class 1 increase h6 1 lr, {r
el.ds -- e t h(xttt)
96
;f xf I rrft"r orilt >e
Loan Example: ( Yo +l,X' +b1

Evaluating Classifi cation Performance *Lrrr eir


= ,bu
Performance measures: ConfiBion matrix and of misclassifications
-clo
il t/'l o llt L,

More uselul in this example: lift (or gains) chart

The "Iift" over the base curve indicates for a given number of cases (read on the x-axis), the
additional responders that you can identifu by using the model.
The same information is portra)'ed in in Decile-wise lift chart: Taking the ro% of the records
that are ranked by the model as "most probable t's" yields 7.9 times as many 1's as rvould simply
selecting ro* ofthe records at random.

11
Lift chart (validation dataset)

250
200
ailt! Y'fnar<
0) Personal Loan
-Cumulative
.I when sorted
vtI.itt<
G 150
I f ''fu
rf )to untf{ using predicted
E 100 values
J
o 50 t Personal Loan
-Cumulative
0 using average
0 1000 2000 3000
# cases

Decile-wise Iift chart (validation dataset)

E 10
.o
-9
o
I
6
EE 4
Etr
-9 2
o
o
o 0
1234 56 78910
Deciles

72
##*# Eiqure 10.3
> lnstal1 . packages ( "gains " ) * to get lift chart
library(gains) y iv fu
f
galn <- gains lvalid. df$ Personal.l,oan, logit.reg.pred,
V^1.,1.fr^
groups-10)

> class (gain )

[1] "gains "


> names (gain )

[1] "depth" "cume . obs " "mean. resp"


" cume . mean. tesp" "crrme. t.of.tota1"
"cume. lift " "mean. prediction" ,min, pr.ediction"
"nax . predictlon"
[13] "optima1" "num. groups " "percents "
lk lg;riY
' , tilt o" o '
> data. frame (c (0/ ea1:$ cune_,:c *
1.
o
lr;3t af sum (valid. dfs Persona.I . Loan) )

c (0, gair:9cume. obs ) )

c.0. . gain. cune. pct. of . Lotal. . . suni. va1id. Cf, Persona L , Loan. . c.0..gain.cu.ne.obs.
1 0 0

2 151 200
3 112 400
A 180 600
5 185 800
6 18? 1000
1 187 7240
I 189 1400
190 1600
10 191 1800
11 191 20c0

Y -a*'., X - it\\J
data.frame( sum(vafid.df$Personal.Loan) ) , c(0, dim(valid.df) t1l )
c (0, )

c. 0. . sum. va1id. df . Perso:laL. Loan- . c. 0. .dim. valid, df . . 1. . nulrtl V'Fq


1 00 rqt I
I
2 191 2000
I
44tk
be.nh

+ plot lift char:t tlv,c irz lql'l tua u >orD


a* ,1- oht , P{' fYr.lrf
t.
pfot (c (0, gaingcume. pct. of. total*sum (valj-d, df$ Personal. Loan) ) -c ( 0, gain$cume. obs )

xiab="# cases", ylab-rrCumulative *, main="", type:',1") ,rliI


lines(c(0,sumlvalid.dfSPersonal.Loan))'c(0, dirn(valid.df) [1]), Ity:2)
r,-

drlt4

13
V,rl l rt-;tli r. rn'ry lt
pl;riil f uixtr ltu wr*'t
i ca'tcd
Llltat
n \
I T
o
rJ) mrirrg Y'lo'^^n''
1t*'l
{,)
'r: o
ood- tl )tao tit,
!o
E I. hYtnn'lc
(no wLl) 'i'l.t. tt\t- lA,
O te
fr,fYnran " lt rN<
o
lr,

0 500 1000 1500 2000

# cases

> # compute ciec-les and plot decile-wise chart


> heights <- garn9mean.resp/mearr(r/allC.dfsPersonal.Loan)

> gainsmean. re sp
t1l 0.755 0.105 0.010 0.125 0.010 0.000 0.c10 0.00s 0.005 0.000
> gainSmean. resp* 2 0 0
[1] 151 27 B 5 2 O 2 7 1 C
> mearr (va 1id. df S Persona I . Loan )

[1] 0.09ss
> height s

[1]
7.90575916 r.0994'1644 0.41881111.7 0.261r8010 0.104112A4 0.00000000 0.10471204
0.05235602 0.45235602 0.0C000000

1,4
midpolnts <- barplot (heig:rts, nanes.arg = gain$depth, ylim = c(0,9).
xlab = "Percentilerr, ylab = "Nlean Response", main: "Dec.i.Le-wise lift chart")

+ add -abe1s to colunns


text(mrdpoints, heights+0.5, Iabels-rc,urLd(heights, 1). cex : 0.8)

Decile-wi:se lift chart

79
@

(o
o
C
o
o-
c,
t
ES
C'
c)

a'l
11

0.4
u'r o 1 n o.l 0.1 0.1 o
o ffim B:Ei!, Er,*E @

10 20 30 40 50 - 60 7A 80 90 100

Percentile

In gains package:
Actual: a numeric vector ofactual response values
Predicted: a numeric vector of predicted response values. This vector must have the same length
as actual, and the ith value of this vector needs to be the model score for the subject with the ith
value ofthe actual vector as its actual response.
Groups: an integer containing the number of rows in the gains table. The default value is 10.

15
Multicollinearity

Problem: As in linear regression, ifone predictor is a linear combination of other predictor(s),


model estimation vr,ill fail

Note that in such a case, we have at least one redundant predictor

Solution: Remove extreme redundancies (by dropping predictors via variable selection. or by
data reduction methods such as PCA)

Variable Selection
This is the same issue as in linear regression
I . The number of correlated predictors can grow when we create derived variables such as
interaction terms (e.9. Income x Famib), to captvre more complex relationships
2. Problem: Overly complex models have the danger of overfitting

3. Solution: Reduce variables via automaled selection of variable subsets (as with linear
regression)
4. SeeChapter6 tibverl I l4fttt>
claoAaC
I

P-values for Predictors

1. Test null hypothesis that coefficient = 0


2. Useful for review to determine whether to include variable in model
3. Important in profiling tasks, but less important in prediclive classification

Summan'
I . Logistic regression is similar to linear regression, except that it is used with a categorical
response

2. It can be used for explanatory tasks (=profiling) or predictive tasks (=classification)


3. The predictors are related to the response Y via a nonlinear fi.rnction called the /ogr7

4. As in linear regression, reducing predictors can be done via variable selection


5. Logistic regression can be generalized to more than two classes

16
Problems
l. Financial Condition of Banks. The fie Banks.c.sv includes data on a sample of
20 banks. The "Financial Condition" column records thejudgment ofan expen on the
financial condition ofeach bank. This outcome variable takes one of two possible
values-ryeak or strong-according 10 the financial condition of the bank. The predictors
are two ratios used in the financial analysis ofbanks: Totlns&Lses/Assets is the ratio of
total loans and leases to total assets and TotExp/Assets is the ratio oftotal expenses to
total assets. The target is to use the trvo ratios for classifuing the financial condition ofa
new bank.
Run a logistic regression model (on the entire dataset) that models the status ofa bank as
a function ofthe two financial measures provided. Speciff the saccess class as wealr (this
is similar to creating a dummy that is I for tinancially weak banks and 0 otherwise), and
use the default cutoff value of 0.5.

a. Write the estimated equation that associates t}re financial condition of a banl< with
its t\ro predictors in three formats:
i. The logit as a function ofthe predictors
ii. The odds as a function ofthe predictors
iii. The probability as a tirnction ofthe predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and
total expenses/assets ratio :0.11. From your logistic regression model, estimate
the lbllowing four quantities for this bank (use R to do all the intermediate
calculations; show your final answers to four decimal places): the logit, the odds,
the probability ofbeing financially weak, and the classification ofthe bank (use
cutoff = 0.5).
c. The cutoff value o10.5 is used in conjunction with the probability ofbeing
financially weak. Compute the threshold that should be used if u,e want to make a
classification based on the odds ofbeing financially weak, and the threshold for
the corresponding logit.

d. Interpret the estimated coefllcient for the total loans & leases to total
assets ratio (Totlns&LsesiAssets) in terms ofthe odds ofbeing financially weak.
e. When a bank that is in poor financial condition is misclassified as
financially strong. the misclassitication cost is much higher than when a
financially strong bank is misclassified as weak. To minimize the expected cost of
misolassification, should the cutof-f value for classification (which is currently at
0.5) be increased or decreased?
2. ldentifying Good System Administrators. A management consultant is studying
the roles played by experience and training in a system administrator's ability to
complete a set oftasks in a specified amount cr1'time. In particular, she is interested in
discriminating between administrators who are able to complete given tasks within a
specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. 1'hey are stored in lhe {rle SystemAdministators.csv.

1,7
The variable Experience measures months of lull-1ime system administrator experience,
while Training measures the number of relevant training credits. The outcome variable
Completed is either les or -Nb, according to whether or not the administrator completed
the tasks.
a. Create a scatter plot of Experience vs. Training using color or symbol to
distinguish prograrnmers u'ho completed the task from those who did not
complete it. Which predictor(s) appea(s) potentially useful for classiflying task
completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who completed the task, what is the percentage of
programmers incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds
0.5?
3. Sales of Riding Mowers. A company that manufactures riding mowers wants to
identify the best sales prospects lbr an intensive sales campaign. In particular. the
manufacturer is interested in classili'ing households as prospective owners or nonowners
on the basis of Income (in $1000s) and Lot Size (in 1000 ft'). The marketing expert
looked at a rardom sample of 24 households, given in the flJ.e RidingMower.r.csv. LIse a[[
the data to fit a logistic rcgression o1'ownership on the two predictors.
a. Wtat percentage of households in the study were owners of a riding mou,er?
b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish
orvners from nonowners. From lhe scatter plot, which class seems to have a
higher average income, ou'nors or nonowners?
c. Among nonowners, what is the percenlage of households classified correctly?
d. l'o increase the percentage ofcorrectly classified nonowners, should the cutoff
probability be increased or decreased?
e. What are the odds that a household with a $60K income and a lot size of 20,000
ft'is an ou.ner?
f. What is the classification of a household with a $60K income and a lot size of
20,000 ft?? Use cutoff :0.5.
g. What is the minimum income that a household with 16,000 ft? lot size should
have before it is classified as an owner?

18

You might also like