Multiple logistic regression models:
an example
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
ML estimation 3
Dummy variable coding schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
disease vs. age, area & status - R summary output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
disease vs. age, area & status - Estimated odds ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Hypothesis testing 7
H0 : βage = βSect2 = βMiddle = βUpper = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Likelihood ratio test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
H0 : βMiddle = βUpper = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Likelihood ratio test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
H0 : βage = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Likelihood ratio test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Choice of the dummy coding schemes 14
Change in the coding scheme for the dependent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
disease vs. age, area & status - R summary output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Change in the coding scheme for a regressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
disease vs. age, area & status - R summary output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
Introduction
In a health study to investigate an epidemic outbreak of a disease that is spread by mosquitoes in a city, 98 individuals were
randomly sampled (See also Kutner et al., 2005, Chapter 14). For each individual, information about the following variables was
collected:
■ disease: absence/presence of specific symptoms associated with the disease,
■ age: age of the individual (years),
■ area: sector of the city in which the individual lives (two categories: sector 1/sector 2),
■ status: socio-economic status of the household to which the individual belongs (three categories: lower/medium/upper)
Is there a significant association between the presence of the disease syptoms and any of the regressors?
Stat. Mod. Giuliano Galimberti – 2
ML estimation 3
Dummy variable coding schemes
■ disease:
y
absence 0
presence 1
⇒ πi = P (yi = 1) = P (diseasei = present) i = 1, . . . , n
■ area:
areaSect2
Sector 1 0
Sector 2 1
■ status:
statusMiddle statusUpper
Lower 0 0
Middle 1 0
Upper 0 1
Stat. Mod. Giuliano Galimberti – 4
2
disease vs. age, area & status - R summary output
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.618 0.613 -4.270 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
statusMiddle 0.714 0.654 1.092 0.275
statusUpper 0.305 0.604 0.505 0.613
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 101.05 on 93 degrees of freedom
AIC: 111.05
Note that:
■ the null deviance corresponds, up to a constant, to minus twice the maximized log-likelihood for a logistic regression model
that contains only the intercept (without regressors)
■ the residual deviance corresponds, up to a constant, to minus twice the maximized log-likelihood of the fitted model
■ for a multiple logistic regression model, AIC is given, up to a constant, by the residual deviance plus twice the number of model
parameters
Stat. Mod. Giuliano Galimberti – 5
disease vs. age, area & status - Estimated odds ratios
bk exp (bk )
age 0.030 1.0302
areaSect2 1.575 4.8295
statusMiddle 0.714 2.0422
statusUpper 0.305 1.3570
■ the odds of an individual having contracted the disease increase by about 3.0 percent with each additional year of age, for given
city sector location and socio-economic status
■ the odds of an individual from sector 2 having contracted the disease are almost five times as great as for an individual from
sector 1, for given age and socio-economic status
■ the odds of an individual with middle socio-economic status having contracted the disease are almost twice times as great as
for an individual with lower socio-economic status, for given age and city sector location
■ the odds of an individual with upper socio-economic status having contracted the disease are about 35 percent larger than the
odds of an individual with lower socio-economic status, for given age and city sector location
Stat. Mod. Giuliano Galimberti – 6
3
Hypothesis testing 7
H0 : βage = βSect2 = βMiddle = βUpper = 0
■ Full model:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.618 0.613 -4.270 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
statusMiddle 0.714 0.654 1.092 0.275
statusUpper 0.305 0.604 0.505 0.613
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 101.05 on 93 degrees of freedom
AIC: 111.05
■ Reduced model:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.771 0.217 -3.548 0.000
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 122.32 on 97 degrees of freedom
AIC: 124.32
Stat. Mod. Giuliano Galimberti – 8
Likelihood ratio test
Model Resid. Df Resid. Dev Df Deviance Pr(>Chi)
disease~1 97 122.32
disease~age+sector+status 93 101.05 4 21.26 0.0003
L(F )
2 ln = −2 ln [L(R) − L(F )] = 122.32 − 101.05 = 21.26
L(R)
■ At least one of the three regressors is significantly associated with the presence of the disease (at a significance level α = 0.01)
■ Note that the degrees of freedom for this test statistic are equal to 4, since 4 regression coefficients are set equal to 0,
according to H0
Stat. Mod. Giuliano Galimberti – 9
4
H0 : βMiddle = βUpper = 0
■ Full model:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.618 0.613 -4.270 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
statusMiddle 0.714 0.654 1.092 0.275
statusUpper 0.305 0.604 0.505 0.613
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 101.05 on 93 degrees of freedom
AIC: 111.05
■ Reduced model:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.335 0.511 -4.569 0.000
age 0.029 0.013 2.224 0.026
areaSect2 1.673 0.487 3.434 0.001
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 102.26 on 95 degrees of freedom
AIC: 108.26
Stat. Mod. Giuliano Galimberti – 10
Likelihood ratio test
Model Resid. Df Resid. Dev Df Deviance Pr(>Chi)
disease~age+sector 95 102.26
disease~age+sector+status 93 101.05 2 1.21 0.5474
L(F )
2 ln = −2 ln [L(R) − L(F )] = 102.26 − 101.05 = 1.21
L(R)
■ There are not significant differences in the probability of having the disease among the three categories of socio-economic
status, for given age and city sector location
■ Note that the degrees of freedom for this test statistic are equal to 2, since 2 regression coefficients are set equal to 0, in order
to exclude the socio-economic status from the full model
Stat. Mod. Giuliano Galimberti – 11
5
H0 : βage = 0
■ Full model:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.618 0.613 -4.270 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
statusMiddle 0.714 0.654 1.092 0.275
statusUpper 0.305 0.604 0.505 0.613
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 101.05 on 93 degrees of freedom
AIC: 111.05
■ Reduced model:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.917 0.481 -3.984 0.000
areaSect2 1.620 0.486 3.336 0.001
statusMiddle 0.713 0.636 1.120 0.263
statusUpper 0.478 0.583 0.820 0.412
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 106.20 on 94 degrees of freedom
AIC: 114.2
Stat. Mod. Giuliano Galimberti – 12
Likelihood ratio test
Model Resid. Df Resid. Dev Df Deviance Pr(>Chi)
disease~sector+status 94 106.20
disease~age+sector+status 93 101.05 1 5.15 0.0233
L(F )
2 ln = −2 ln [L(R) − L(F )] = 106.20 − 101.05 = 5.15
L(R)
!
b2age 0.032
=
∼ = = 4.854
s2 [bage ] 0.0142
The age of an individual has a significant effect on the probability of having the disease, for given city sector location and
socio-economic status, but only if one considers a significance level equal to α = 0.05
Stat. Mod. Giuliano Galimberti – 13
6
Choice of the dummy coding schemes 14
Change in the coding scheme for the dependent variable
■ Original coding scheme:
y
absence 0
presence 1
■ Alternative coding scheme:
y
absence 1
presence 0
⇒ πi = P (yi = 1) = P (diseasei = absent) i = 1, . . . , n
Stat. Mod. Giuliano Galimberti – 15
disease vs. age, area & status - R summary output
■ Original coding scheme:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.618 0.613 -4.270 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
statusMiddle 0.714 0.654 1.092 0.275
statusUpper 0.305 0.604 0.505 0.613
Residual deviance: 101.05 on 93 degrees of freedom
■ Alternative coding scheme:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.618 0.613 4.270 0.000
age -0.030 0.014 -2.203 0.028
areaSect2 -1.575 0.502 -3.139 0.002
statusMiddle -0.714 0.654 -1.092 0.275
statusUpper -0.305 0.604 -0.505 0.613
Residual deviance: 101.05 on 93 degrees of freedom
The two models are equivalent (they have the same residual deviance): the change in the coding scheme affects only the signs of
the regression coefficients
Stat. Mod. Giuliano Galimberti – 16
7
Change in the coding scheme for a regressor
status
■ Original coding scheme:
statusMiddle statusUpper
Lower 0 0
Middle 1 0
Upper 0 1
Reference category: Lower
■ Alternative coding scheme:
status1 status2
Lower 1 0
Middle 0 1
Upper 0 0
Reference category: Upper
Stat. Mod. Giuliano Galimberti – 17
disease vs. age, area & status - R summary output
■ Original coding scheme:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.618 0.613 -4.270 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
statusMiddle 0.714 0.654 1.092 0.275
statusUpper 0.305 0.604 0.505 0.613
Residual deviance: 101.05 on 93 degrees of freedom
■ Alternative coding scheme:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.313 0.643 -3.599 0.000
age 0.030 0.014 2.203 0.028
areaSect2 1.575 0.502 3.139 0.002
status1 -0.305 0.604 -0.505 0.613
status2 0.409 0.599 0.682 0.495
Residual deviance: 101.05 on 93 degrees of freedom
The two models are equivalent: the change in the coding scheme for status affects only the intercept and the coefficients
associated with the corresponding dummy variables
Stat. Mod. Giuliano Galimberti – 18