0% found this document useful (0 votes)
2 views32 pages

15-GLM

The document discusses Generalized Linear Models (GLMs), which extend classic linear regression to accommodate response variables from the exponential family and include a link function for transformation. It highlights the importance of GLMs in statistical inference, particularly for categorical data and logistic regression. Additionally, it addresses the limitations of linear probability models in predicting binary outcomes and introduces the logistic function as a solution for constraining predictions between 0 and 1.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views32 pages

15-GLM

The document discusses Generalized Linear Models (GLMs), which extend classic linear regression to accommodate response variables from the exponential family and include a link function for transformation. It highlights the importance of GLMs in statistical inference, particularly for categorical data and logistic regression. Additionally, it addresses the limitations of linear probability models in predicting binary outcomes and introduces the logistic function as a solution for constraining predictions between 0 and 1.

Uploaded by

tanyalim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

STAT 5703

Statistical Inference and Modeling for Data Science


Dobrin Marchev

covering 2M models : 1
logistic ; Ipoisson
d
.
start exponentially
-
not normally distributed
1
Recall: Multivariate Exponential family
GLM type of family
Let Y = (Y1, … , Yn) be a random sample with joint pdf

𝑓 𝒚; 𝜽 = 𝑐 𝜽 ℎ 𝒚 -σ𝑘
𝑒 𝑗=1 𝑡𝑗 𝒚 𝑞𝑗 (𝜽)
exponential term
where 𝜃 is a k-parameter vector.

Such distribution is said to be in a k-parameter exponential


family.
GLM
categorical linear analysis .

A Generalized Linear Model (GLM) extends the classic linear


regression model in two ways:

1. Y|x ~ Exponential family exponential dispersion function .


[edf]

2. Transformation between the outcome and the predictors:


g[E(Y|x)] = x’𝛽
parameter Related to mean
:

of distribution.
the function g() is called the link function.
GLM
More specifically, GLM assumes that the response variable has a
generalized linear model (aka exponential dispersion model EDM)
pdf:

𝑦𝑖 𝜃−𝜅 𝜃
𝑓 𝑦𝑖 ; 𝜃, 𝜙 = 𝑎 𝑦𝑖 , 𝜙 𝑒 𝜙 , 𝑖 = 1, … , 𝑛

where 𝜙 is called a dispersion parameter (and could be known).


𝜅 𝜃 is called cumulant function. Binom depends on parameterp .
This form of the distribution is aka#
natural exponential family,
because 𝜃 is the natural parameter.
Notation: Y ~ EDM(𝜇, 𝜙), where E(Yi) = 𝜇

Note: For a fixed value of the dispersion parameter ϕ it is a one-


parameter exponential family (indexed by 𝜃).
Example: Normal distribution
The normal distribution with unknown mean 𝜇 and variance 𝜎2:
𝑦2 𝜇2
𝑦−𝜇 2 − 𝑦𝜇−
1 − 𝑒 2𝜎2 2
𝑓 𝑦; 𝜇, 𝜎 2 = 𝑒 2𝜎2 = 𝑒 𝜎 2

2𝜋𝜎 2 2𝜋𝜎 2
Match with: D out needto do

𝑦𝜃−𝜅 𝜃 GLM with normal


𝑓 𝑦; 𝜃, 𝜙 = 𝑎 𝑦, 𝜙 𝑒 𝜙 model .

𝜃 = 𝜇 is the natural aka canonical parameter.


𝜇2 𝜃2
𝜅 𝜃 = = is the cumulant function
2 2
𝜙 = 𝜎2 is the dispersion parameter scale
- .

𝑦2

𝑒 2𝜎2
𝑎 𝑦, 𝜙 = 2
is the normalizing function
2𝜋𝜎
Other examples are: Exponential, Gamma, Binomial, …
GLM: Moment generating and cumulant
functions Mean variance
Theorem:
,

𝑀 𝑡 = 𝑒
𝜅 𝜃+𝑡𝜙 −𝜅(𝜃)
𝜙

𝜅 𝜃 + 𝑡𝜙 − 𝜅(𝜃)
↓ easier with mgf
cummulant
:

𝐾 𝑡 = easier w
𝜙 function, K(t) -

first/second/third

S
commulent part
𝜇2 of pdf derivative
Example: Normal distribution 𝜅 𝜇 = .

> evaluated@0
2 -

G get first/213
2 2 2
𝜎2𝑡2

𝜇 + 𝑡𝜎 𝜇 >
-

𝐾 𝑡 = 2
− 2 = 𝜇𝑡 + .
moment
2𝜎 2𝜎 2
↓ Y M function.
first derrative : mean subin 2nd formula : you get
second derivative :
variance. exactly cummulent generating
function of normal distribution .
3rd 14th : mgf.
GLM: Mean and variance Terminology
Theorem:
Link function , exp family,
cummulent function
𝑑𝜅(𝜃)
𝐸 𝑌 =𝜇=
𝑑𝜃
𝑑 2 𝜅(𝜃)
𝑉𝑎𝑟 𝑌 =O 𝜙
𝑑𝜃 2
where
𝑑 2 𝜅(𝜃)
𝑑𝜃 2
=
𝑑 𝑑𝜅 𝜃
𝑑𝜃 𝑑𝜃
=
𝑑𝜇
d𝜃
o man can also be written as

of mean .
part

We can define the variance function


d𝜇
𝑉 𝜇 = ⇒ 𝑉𝑎𝑟 𝑌 = 𝜙𝑉(𝜇)
d𝜃
𝜃2 In general need to do as
part
Example: Normal distribution 𝜅 𝜃 = ,

2
d
of

𝜃2
𝑑
𝐸 𝑌 =
𝑑𝜃
2
#
= 𝜃 = 𝜇, 𝑉 𝜇 = 1 ⇒ 𝑉𝑎𝑟 𝑌 = 𝜎 2
Applied Question :
-

GLMQE ; Regression -

interpretation ; theory :
Hw , not exam.

Deviance& variance :
GLM: Unit deviance
𝑦𝜃−𝜅 𝜃
Suppose we want to write the EDM 𝑓 𝑦; 𝜃, 𝜙 = 𝑎 𝑦, 𝜙 𝑒 𝜙 as a
function of the mean 𝜇. Denote
might depend on transformation,
>
-

𝑡 𝑦, 𝜇 = 𝑦𝜃 − 𝜅(𝜃)

Then 𝑡 𝑦, 𝜇 has a unique max w.r.t. 𝜇 at -


𝜇 = y. This allows us to
C the unit deviance:
define a very important quantity,
Residual : Difbtw
always positive
.

predicts observed.
𝑑 𝑦, 𝜇 = 2 𝑡 𝑦, 𝑦 − 𝑡 𝑦, 𝜇

Notice that d(y, μ) = 0 only when y= μ and otherwise, d(y, μ) > 0. In


fact, d(y, μ) increases as μ moves away from y in either direction. This
shows that d(y, μ) can be interpreted as a type of distance measure
between y and μ.
when to use GLM ?
data
prev, assume normal
>
GLM: Summary
.

data natural
space M parameter.

continuous .

I
Oor 1 .

value are =
Y
.
counts

E
skewed
positive data
Classification and Logistic Regression
The linear regression model assumes the response variable Y is
quantitative (numerical) and the error terms are normally distributed. If,
instead, the response variable is qualitative, (or categorical), the task of
predicting responses is aka&classification. In such cases the error terms
are not normally distributed. Broadtopic -SVMetc
.
.

Examples of classification problems:


• Your email service determines whether to label an incoming
message as “spam” or not based on the text of the email, the subject,
and your history of interaction with the sender.
• Hand-written zip codes are scanned and stored as an image file and
a computer is programmed to classify each digit as a “0”, “1”, “2”,
… , “9”.

10
Linear Regression Approach for Binary Response

• Consider Yi ~ Bernoulli(𝜋i), where 𝜋i = P(Yi = 1 | Xi = xi).


• We could use a linear probability model (LPM) with the usual
assumptions:

P
I 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖,𝑝 + 𝜀𝑖
𝜀𝑖 ~ 𝑁 0, 𝜎 2

• The expected values for the linear probability model from RHS are:

𝐸 𝑌𝑖 ȁ𝑋𝑖 = 𝑥𝑖 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖,𝑝

cannot restrict
• But for a Bernoulli outcome from the LHS we have:
3
predictors from
𝐸 𝑌𝑖 ȁ𝑋𝑖 = 𝑥𝑖 = 𝜋𝑖 = P 𝑌𝑖 = 1 𝑋𝑖 = 𝑥𝑖 ) O to 1.
-

reality Probability function .


:
11
Problems with Linear Probability Model

• Predicted values from


linear regression are
not constrained to the
interval [0,1], even
though probabilities
should be.
General :
good idea . Violatingother
Not

assumptions as well .

12
Problems with Linear Regression
(Continued)

• Predicted values from linear regression are not constrained to the


interval [0,1], even though probabilities should be.
• Residuals from the linear model are not normally distributed because
but
they are dichotomous conditional X = x: Not Bernouilli discrete with
,

2 values
−𝑝 𝑥 if 𝑌𝑖 = 0 difx values diferrous.
𝜀𝑖 = ቊ
:

1 − 𝑝 𝑥 if 𝑌𝑖 = 1

• Also, the variance of εi is p(x)[1 – p(x)], which means it is not


constant. ~
outcomes not homogeneous .

• Furthermore, if the outcome has more than two categories, linear


-

regression becomes impossible unless they are ordered, and we


assume the distances between all categories are identical.
• We need a method that deals with the above-mentioned deficiencies.
GLM :
Haslinked functions Restricting predictors => Transform function (RHS)
Logistic Regression: Foundation
• Assume the outcome Y is coded as 0/1. Our goal is to specify a
model for
𝑝 𝑋 = Pr(𝑌 = 1ȁ𝑋 = 𝑥)

• In linear regression we use a linear model in the covariates X1, …,


Xp:
𝑝(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
Numerator smaller
denominator
Logistic Function . I :
than .

• The model above, however, has a range equal to (-∞, ∞). To restrict
our range to (0,1), we need a function f:(-∞, ∞) → (0,1). Ideally, the
function will be simple to write, continuous, and monotonic. The
logistic function is a prime candidate:
standard logistic function .

exp(𝑡)
𝑓 𝑡 =
1 + exp(𝑡)
14
Logistic/Sigmoid
Function
The sigmoid function
resembles an S-shaped curve.
It takes the real numbered
input values and converts
them between 0 to L (by
shrinking from both sides,
i.e., the negative values to 0
and very high positive ones
to L). Note that it can be
decreasing as well.

intercept growth How fast curve fits between


: :

O and 1.

15
Sigmoidal Response Functions
can use for neural networks :

flexible base for more complicated models.


X
• A&sigmoid function is a function having a characteristic "S"-shaped
curve or sigmoid curve.
Fit binomial model with logistic .

• The logistic function is a prime candidate:

I
exp(𝑡)
𝑓 𝑡 = Use Logistic model
1 + exp(𝑡)
>
- want to interpret result .

• An alternative is the probit function:

Φ 𝑡
where  is the cdf of the standard normal distribution.

• Theoretically, any cumulative distribution function can serve as a


response function.
16
① interpret
② switch from probability to
Logistic Regression: Foundation odds/logodds .
• Applying the logistic function to the linear transformation of the
predictors gives three equivalent formulations. -when askto interpres additive

1. probability
depends on predictor with logistic function

exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
I-
effect.
All 3 are
equivalent .

j
the power
𝑝 𝑋 = e to

dif :
-
1 + exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 - of the
predictor

I
B!
2. odds C extra multiplier > % Dinodds
-
: e
𝑝 𝑋
= exp 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
G
1−𝑝 𝑋 easier to but eral odds interpret ; of event , not probability .

This is no
longer additive model . Interpret : Multiplicative effect of
-

3. log (odds) or “logit” magnitude .


: As a % .
D
𝑝 𝑋
log = 𝛽0 + 𝛽11
𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
1−𝑝 𝑋
Bumps /unit log transform both side.
#
-

Exercise: Show algebraically that 3 implies 1. 17


should be easy
.
Example: Bank Marketing
offered product output
people :
Data in 15-bank-full.csv come from here: ,

did they agree to buy


https://archive.ics.uci.edu/dataset/222/bank+marketing

• The data are related with direct marketing campaigns of a Portuguese


banking institution. The marketing campaigns were based on phone
calls. Often, more than one contact to the same client was required, to
access if the product (bank term deposit) would be ('yes') or not ('no')
!
subscribed.
• Explanatory variable thought possibly to affect the outcome were:
• Job occupation: admin.’, 'blue-collar’, 'entrepreneur’, 'housemaid’,
'management’, 'retired’, 'selfemployed’, 'services’, 'student’, 'technician’,
'unemployed’, 'unknown',
• Housing: has housing loan or no,
• Balance: average annual balance
• Marital status
• …

The data has 45,211 rows.


18
Logistic Regression Example

Use “glm” function to fit GLM models. “family” option determines the
distribution to be used. Choices are:
Al depends on likelihood.
:

More predictor
• binomial(link = "logit") lot of predictor good
idea : .
Drop a

• gaussian(link = "identity") result :

958e ance
Balance
• Gamma(link = "inverse") 3
04 Intercept
: 1 .

j
.
0 . 00


• inverse.gaussian(link = "1/mu^2") ↓ predict Clogit
(bankblance
....

plot
,
1000
• poisson(link = "log") > *

exp (coeff
(logitO
>
1000 of
"For each extra
euros

• quasi(link = "identity", variance = "constant") balancing increase, ...


of euros

balance the odds of obtaining


• quasibinomial(link = "logit") the product increases by 4 %

• quasipoisson(link = "log") -

See R code
19
Logistic Regression Example: Some Results
> logit0 = glm(y ~ balance, data = bank, family = "binomial")
> summary(logit0)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.081e+00 1.595e-02 -130.50 <2e-16 ***
balance 3.958e-05 3.840e-06 10.31 <2e-16 ***

Equation: log(odds of subsrcibing) = -2.081 + 0.00003958*balance


PartA :
Write down estimated model .

Interpretation of the slope:


• If the balance increases by 1 euro, then the log of the odds (of
subscribing) will increase by 0.00003958 units.
• Equivalent (and better!): the odds will be multiplied byF e0.00003958 =
3
1.00004 for each extra 1 euro of balance.
• Final interpretation : For each extra 1000 euros of balance the
odds of obtaining the product increase by 4%
20
Logistic Regression Example:
Dichotomous Covariate
odds will decrease if you increase the variable.
R output: Negative :

I
o
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60687 0.01892 -84.93 <2e-16 ***
housingyes -0.87696 0.03030 -28.95 <2e-16 ***

Have mortgage
: Less
likely
to subscribe. 𝑝Ƹ 𝑋𝑖
log = 𝛽መ0 + 𝛽መ1 𝑋𝑖
1 − 𝑝Ƹ 𝑋𝑖
odds that have
calculating
housing loan or .
not

21
Logistic Regression Example:
Dichotomous Covariate
• Let’s use only the housing variable
> logit0 = glm(y ~ housing, data = bank, family =
"binomial")
> summary(logit0)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60687 0.01892 -84.93 <2e-16 ***
housingyes -0.87696 0.03030 -28.95 <2e-16 ***

• Report the regression equation:

𝑝Ƹ
log = −1.60687−0.87696 ∗ Housing
1 − 𝑝Ƹ

22
Logistic Regression Example:
Dichotomous Covariate
Compare the equation to cross-tabs results:

No housing loan prob of subscribing = 16.7%


logscale :

With housing loan prob of subscribing = 7.7%

d
𝑝Ƹ
log = −1.60687−0.87696 ∗ Housing
1 − 𝑝Ƹ
Calculate probability, odds and log-odds for each category:
No housing loan Housing loan
? Probability ?

? Odds ?

A
C
? Log-odds ? How to
go
this num ber.
bability Sigmoid Function
.
exponentiate 23
This interpretation
Logistic Regression Example:
IMPORTANT
Dichotomous Covariate
Is
-

• The odds can be interpreted as the relative risk of an event A


happening versus not happening. That is,

𝑝𝑟𝑜𝑏(𝐴) - Happening.
𝑜𝑑𝑑𝑠(𝐴) =

5
-

1 – 𝑝𝑟𝑜𝑏(𝐴) > Not happening

Odds = 0.2 = 1:5 means likelihood of subscribing to the service is 0.2 times the likelihood of not
subscribing, within the no housing loan group. Or, you can say that the not subscribing is 5 times
more likely than subscribing within the no housing group.
-

-
odds of subscribing 1 5.
:

Exercise: Interpret the housing loan = yes group odds.

• The odds ratio is the ratio of odds for cases with different x values.
0.200514
Odds ratio = 0.083423
= 2.4036
This means that in the absence of a house loan, the odds of subscribing are 2.4 times higher,
compared to the group which have a house loan.
25
Logistic Regression: Estimating Betas
• Just as in simple regression, the coefficients β0 and β1 are
unknown and must be estimated using the available data.
Maximum likelihood is the most commonly used method for this
problem.
• When 𝑌𝑖 ȁ𝑥𝑖 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝 𝑥𝑖 , then the likelihood function is:
𝑛
𝑦𝑖 1−𝑦𝑖
𝐿 𝛽0 , 𝛽1 = ෑ 𝑝 𝑥𝑖 1 − 𝑝 𝑥𝑖
𝑖=1
𝑝 𝑥𝑖 1
Since = 𝑒 𝛽0 +𝛽1 𝑥𝑖 ⇒ 1 − 𝑝 𝑥𝑖 = and therefore
1−𝑝 𝑥𝑖 1+𝑒 𝛽0 +𝛽1 𝑥𝑖
𝑛
𝛽0 +𝛽1 𝑥𝑖 𝑦𝑖
1
𝐿 𝛽0 , 𝛽1 = ෑ 𝑒
1 + 𝑒𝛽0 +𝛽1 𝑥𝑖
cannot solve 𝑖=1
Taking grad analytically
:
.
𝑛 𝑛

⇒ 𝑙 𝛽0 , 𝛽1 = ෍ 𝑦𝑖 𝛽0 + 𝛽1 𝑥𝑖 − ෍ log 1 + 𝑒 𝛽0 +𝛽1 𝑥𝑖
𝑖=1 𝑖=1 26
Logistic Regression: Estimating Betas
• The logistic regression loglikelihood is:
𝑛 𝑛

𝑙 𝛽0 , 𝛽1 = ෍ 𝑦𝑖 𝛽0 + 𝛽1 𝑥𝑖 − ෍ log 1 + 𝑒 𝛽0 +𝛽1 𝑥𝑖
𝑖=1 𝑖=1
𝑛
𝜕𝑙 𝛽0 , 𝛽1 𝑒 𝛽0 +𝛽1 𝑥𝑖
⇒ = ෍ 𝑦𝑖 −
𝜕𝛽0 1 + 𝑒𝛽0 +𝛽1 𝑥𝑖
𝑖=1
• Unlike the closed-form analytical solution (i.e., the normal
equations) available for linear regression, the score function
for logistic regression is transcendental. That is, there is no
closed-form solution so we must solve with numerical
optimization methods such as Newton’s method (IRLS) which
is done in several iterations.
• Although we have focused on simple logistic regression, this
likelihood (and score) function generalize directly to the case
of p predictors. 27
Deviance >
-
Difbtw Residuals of difmodel .

Most parameterised.

It measures the deviance of


the fitted generalized linear
model with respect to a perfect
model for the sample.
This perfect model, known as
the saturated model, is the
Model : underfitted,
model that perfectly fits the .
underparameterised
data, in the sense that the fitted
responses equal the observed
responses.
For a linear model, deviance is
the sum of squared errors (SSE)
and D0 is the total sum of
squares (SST).
Deviance (continued)
• The deviance for a GLM m is defined as How well fitted likelihood model gets closer
𝐷(𝒚, 𝑚) = −2 𝑙 𝜷 ෡ − 𝑙𝑆 𝜙 I to the saturated model .

• It measures the difference of the fitted generalized Deviance between models


linear model with respect to a perfect model for the
sample. whatever we want to fit.
• This perfect model, known as the saturated model, is
the model that perfectly fits the data, in the sense
that the fitted responses equal the observed
responses.
Gap btw
• In the linear case, D = SSE. Best &

• In R, the “Residual Deviance” is two times the Worst .

difference in the log likelihood of the saturated Small


(perfect) model and our model. = Best
Deviance
• In R, Null Deviance = 2[LL(Saturated Model) - LL(Null
likelihood ratio test
Model)]
like a

• Note, deviance is not equal to KL distance


• But 𝐷 𝒚, 𝑚2 − 𝐷 𝒚, 𝑚1 = 𝐾𝐿 𝒚, 𝑚2 − 𝐾𝐿 𝒚, 𝑚1
• See Hastie (1987) article in The American Statistician
for more details.
Don't have to remove entire predictor

Rah with reduced


dataset with less
predictors.
S

& reduced predictor rested in

>
-
Likelihood Ratio Test
Need deviance
larger dataset.

2 nested models ;
O compare
·

↓ model .
switch to Chi-square
expected : we removed 4 predictors .

· can also do this manually


/
Final exam : Not Gamma :
Logistic/Poisson Model .

>
-

output ,
confidence level
Logistic Regression: Model Selection
• In multiple linear regression there were many selection criteria
available: R2, R2-adjusted, AIC, BIC, …
• R2 is based on the underlying assumption that you are fitting a linear
model. If you aren’t fitting a linear model, you shouldn’t use it!
• Only AIC and BIC can be adapted to the GLM with the same
formula but now using the logistic likelihood:
AIC = -2log L + 2p
BIC = -2log L + log(n) p
• Rule is the same: smaller AIC or BIC means better model.
• Note: regsubsets function from the leaps package is
applicable only for linear regression with normal errors. For GLM
you must use bestglm function from the bestglm package.
• See R code for the healthcare example

30

You might also like