15-GLM
15-GLM
covering 2M models : 1
logistic ; Ipoisson
d
.
start exponentially
-
not normally distributed
1
Recall: Multivariate Exponential family
GLM type of family
Let Y = (Y1, … , Yn) be a random sample with joint pdf
𝑓 𝒚; 𝜽 = 𝑐 𝜽 ℎ 𝒚 -σ𝑘
𝑒 𝑗=1 𝑡𝑗 𝒚 𝑞𝑗 (𝜽)
exponential term
where 𝜃 is a k-parameter vector.
of distribution.
the function g() is called the link function.
GLM
More specifically, GLM assumes that the response variable has a
generalized linear model (aka exponential dispersion model EDM)
pdf:
𝑦𝑖 𝜃−𝜅 𝜃
𝑓 𝑦𝑖 ; 𝜃, 𝜙 = 𝑎 𝑦𝑖 , 𝜙 𝑒 𝜙 , 𝑖 = 1, … , 𝑛
2𝜋𝜎 2 2𝜋𝜎 2
Match with: D out needto do
𝑦2
−
𝑒 2𝜎2
𝑎 𝑦, 𝜙 = 2
is the normalizing function
2𝜋𝜎
Other examples are: Exponential, Gamma, Binomial, …
GLM: Moment generating and cumulant
functions Mean variance
Theorem:
,
𝑀 𝑡 = 𝑒
𝜅 𝜃+𝑡𝜙 −𝜅(𝜃)
𝜙
𝜅 𝜃 + 𝑡𝜙 − 𝜅(𝜃)
↓ easier with mgf
cummulant
:
𝐾 𝑡 = easier w
𝜙 function, K(t) -
first/second/third
S
commulent part
𝜇2 of pdf derivative
Example: Normal distribution 𝜅 𝜇 = .
> evaluated@0
2 -
G get first/213
2 2 2
𝜎2𝑡2
↓
𝜇 + 𝑡𝜎 𝜇 >
-
𝐾 𝑡 = 2
− 2 = 𝜇𝑡 + .
moment
2𝜎 2𝜎 2
↓ Y M function.
first derrative : mean subin 2nd formula : you get
second derivative :
variance. exactly cummulent generating
function of normal distribution .
3rd 14th : mgf.
GLM: Mean and variance Terminology
Theorem:
Link function , exp family,
cummulent function
𝑑𝜅(𝜃)
𝐸 𝑌 =𝜇=
𝑑𝜃
𝑑 2 𝜅(𝜃)
𝑉𝑎𝑟 𝑌 =O 𝜙
𝑑𝜃 2
where
𝑑 2 𝜅(𝜃)
𝑑𝜃 2
=
𝑑 𝑑𝜅 𝜃
𝑑𝜃 𝑑𝜃
=
𝑑𝜇
d𝜃
o man can also be written as
of mean .
part
2
d
of
𝜃2
𝑑
𝐸 𝑌 =
𝑑𝜃
2
#
= 𝜃 = 𝜇, 𝑉 𝜇 = 1 ⇒ 𝑉𝑎𝑟 𝑌 = 𝜎 2
Applied Question :
-
GLMQE ; Regression -
interpretation ; theory :
Hw , not exam.
Deviance& variance :
GLM: Unit deviance
𝑦𝜃−𝜅 𝜃
Suppose we want to write the EDM 𝑓 𝑦; 𝜃, 𝜙 = 𝑎 𝑦, 𝜙 𝑒 𝜙 as a
function of the mean 𝜇. Denote
might depend on transformation,
>
-
𝑡 𝑦, 𝜇 = 𝑦𝜃 − 𝜅(𝜃)
predicts observed.
𝑑 𝑦, 𝜇 = 2 𝑡 𝑦, 𝑦 − 𝑡 𝑦, 𝜇
data natural
space M parameter.
continuous .
I
Oor 1 .
value are =
Y
.
counts
E
skewed
positive data
Classification and Logistic Regression
The linear regression model assumes the response variable Y is
quantitative (numerical) and the error terms are normally distributed. If,
instead, the response variable is qualitative, (or categorical), the task of
predicting responses is aka&classification. In such cases the error terms
are not normally distributed. Broadtopic -SVMetc
.
.
10
Linear Regression Approach for Binary Response
P
I 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖,𝑝 + 𝜀𝑖
𝜀𝑖 ~ 𝑁 0, 𝜎 2
• The expected values for the linear probability model from RHS are:
cannot restrict
• But for a Bernoulli outcome from the LHS we have:
3
predictors from
𝐸 𝑌𝑖 ȁ𝑋𝑖 = 𝑥𝑖 = 𝜋𝑖 = P 𝑌𝑖 = 1 𝑋𝑖 = 𝑥𝑖 ) O to 1.
-
assumptions as well .
12
Problems with Linear Regression
(Continued)
2 values
−𝑝 𝑥 if 𝑌𝑖 = 0 difx values diferrous.
𝜀𝑖 = ቊ
:
1 − 𝑝 𝑥 if 𝑌𝑖 = 1
• The model above, however, has a range equal to (-∞, ∞). To restrict
our range to (0,1), we need a function f:(-∞, ∞) → (0,1). Ideally, the
function will be simple to write, continuous, and monotonic. The
logistic function is a prime candidate:
standard logistic function .
exp(𝑡)
𝑓 𝑡 =
1 + exp(𝑡)
14
Logistic/Sigmoid
Function
The sigmoid function
resembles an S-shaped curve.
It takes the real numbered
input values and converts
them between 0 to L (by
shrinking from both sides,
i.e., the negative values to 0
and very high positive ones
to L). Note that it can be
decreasing as well.
O and 1.
15
Sigmoidal Response Functions
can use for neural networks :
I
exp(𝑡)
𝑓 𝑡 = Use Logistic model
1 + exp(𝑡)
>
- want to interpret result .
Φ 𝑡
where is the cdf of the standard normal distribution.
1. probability
depends on predictor with logistic function
exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
I-
effect.
All 3 are
equivalent .
j
the power
𝑝 𝑋 = e to
dif :
-
1 + exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 - of the
predictor
I
B!
2. odds C extra multiplier > % Dinodds
-
: e
𝑝 𝑋
= exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
G
1−𝑝 𝑋 easier to but eral odds interpret ; of event , not probability .
This is no
longer additive model . Interpret : Multiplicative effect of
-
Use “glm” function to fit GLM models. “family” option determines the
distribution to be used. Choices are:
Al depends on likelihood.
:
More predictor
• binomial(link = "logit") lot of predictor good
idea : .
Drop a
958e ance
Balance
• Gamma(link = "inverse") 3
04 Intercept
: 1 .
j
.
0 . 00
↑
• inverse.gaussian(link = "1/mu^2") ↓ predict Clogit
(bankblance
....
plot
,
1000
• poisson(link = "log") > *
exp (coeff
(logitO
>
1000 of
"For each extra
euros
• quasipoisson(link = "log") -
See R code
19
Logistic Regression Example: Some Results
> logit0 = glm(y ~ balance, data = bank, family = "binomial")
> summary(logit0)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.081e+00 1.595e-02 -130.50 <2e-16 ***
balance 3.958e-05 3.840e-06 10.31 <2e-16 ***
I
o
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60687 0.01892 -84.93 <2e-16 ***
housingyes -0.87696 0.03030 -28.95 <2e-16 ***
Have mortgage
: Less
likely
to subscribe. 𝑝Ƹ 𝑋𝑖
log = 𝛽መ0 + 𝛽መ1 𝑋𝑖
1 − 𝑝Ƹ 𝑋𝑖
odds that have
calculating
housing loan or .
not
21
Logistic Regression Example:
Dichotomous Covariate
• Let’s use only the housing variable
> logit0 = glm(y ~ housing, data = bank, family =
"binomial")
> summary(logit0)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60687 0.01892 -84.93 <2e-16 ***
housingyes -0.87696 0.03030 -28.95 <2e-16 ***
𝑝Ƹ
log = −1.60687−0.87696 ∗ Housing
1 − 𝑝Ƹ
22
Logistic Regression Example:
Dichotomous Covariate
Compare the equation to cross-tabs results:
d
𝑝Ƹ
log = −1.60687−0.87696 ∗ Housing
1 − 𝑝Ƹ
Calculate probability, odds and log-odds for each category:
No housing loan Housing loan
? Probability ?
? Odds ?
A
C
? Log-odds ? How to
go
this num ber.
bability Sigmoid Function
.
exponentiate 23
This interpretation
Logistic Regression Example:
IMPORTANT
Dichotomous Covariate
Is
-
𝑝𝑟𝑜𝑏(𝐴) - Happening.
𝑜𝑑𝑑𝑠(𝐴) =
↓
5
-
Odds = 0.2 = 1:5 means likelihood of subscribing to the service is 0.2 times the likelihood of not
subscribing, within the no housing loan group. Or, you can say that the not subscribing is 5 times
more likely than subscribing within the no housing group.
-
-
odds of subscribing 1 5.
:
• The odds ratio is the ratio of odds for cases with different x values.
0.200514
Odds ratio = 0.083423
= 2.4036
This means that in the absence of a house loan, the odds of subscribing are 2.4 times higher,
compared to the group which have a house loan.
25
Logistic Regression: Estimating Betas
• Just as in simple regression, the coefficients β0 and β1 are
unknown and must be estimated using the available data.
Maximum likelihood is the most commonly used method for this
problem.
• When 𝑌𝑖 ȁ𝑥𝑖 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝 𝑥𝑖 , then the likelihood function is:
𝑛
𝑦𝑖 1−𝑦𝑖
𝐿 𝛽0 , 𝛽1 = ෑ 𝑝 𝑥𝑖 1 − 𝑝 𝑥𝑖
𝑖=1
𝑝 𝑥𝑖 1
Since = 𝑒 𝛽0 +𝛽1 𝑥𝑖 ⇒ 1 − 𝑝 𝑥𝑖 = and therefore
1−𝑝 𝑥𝑖 1+𝑒 𝛽0 +𝛽1 𝑥𝑖
𝑛
𝛽0 +𝛽1 𝑥𝑖 𝑦𝑖
1
𝐿 𝛽0 , 𝛽1 = ෑ 𝑒
1 + 𝑒𝛽0 +𝛽1 𝑥𝑖
cannot solve 𝑖=1
Taking grad analytically
:
.
𝑛 𝑛
⇒ 𝑙 𝛽0 , 𝛽1 = 𝑦𝑖 𝛽0 + 𝛽1 𝑥𝑖 − log 1 + 𝑒 𝛽0 +𝛽1 𝑥𝑖
𝑖=1 𝑖=1 26
Logistic Regression: Estimating Betas
• The logistic regression loglikelihood is:
𝑛 𝑛
𝑙 𝛽0 , 𝛽1 = 𝑦𝑖 𝛽0 + 𝛽1 𝑥𝑖 − log 1 + 𝑒 𝛽0 +𝛽1 𝑥𝑖
𝑖=1 𝑖=1
𝑛
𝜕𝑙 𝛽0 , 𝛽1 𝑒 𝛽0 +𝛽1 𝑥𝑖
⇒ = 𝑦𝑖 −
𝜕𝛽0 1 + 𝑒𝛽0 +𝛽1 𝑥𝑖
𝑖=1
• Unlike the closed-form analytical solution (i.e., the normal
equations) available for linear regression, the score function
for logistic regression is transcendental. That is, there is no
closed-form solution so we must solve with numerical
optimization methods such as Newton’s method (IRLS) which
is done in several iterations.
• Although we have focused on simple logistic regression, this
likelihood (and score) function generalize directly to the case
of p predictors. 27
Deviance >
-
Difbtw Residuals of difmodel .
Most parameterised.
>
-
Likelihood Ratio Test
Need deviance
larger dataset.
2 nested models ;
O compare
·
↓ model .
switch to Chi-square
expected : we removed 4 predictors .
>
-
output ,
confidence level
Logistic Regression: Model Selection
• In multiple linear regression there were many selection criteria
available: R2, R2-adjusted, AIC, BIC, …
• R2 is based on the underlying assumption that you are fitting a linear
model. If you aren’t fitting a linear model, you shouldn’t use it!
• Only AIC and BIC can be adapted to the GLM with the same
formula but now using the logistic likelihood:
AIC = -2log L + 2p
BIC = -2log L + log(n) p
• Rule is the same: smaller AIC or BIC means better model.
• Note: regsubsets function from the leaps package is
applicable only for linear regression with normal errors. For GLM
you must use bestglm function from the bestglm package.
• See R code for the healthcare example
30