07 GLM
07 GLM
Y = β0 + β1X1 + β2X2 + …
Linear regression
z <- lm(y ~ x) # x is numeric
𝜇
log = 𝜂 = 𝛽! + 𝛽" 𝑋" + 𝛽# 𝑋# + ⋯
1−𝜇
Greek mu Greek eta
&!
The inverse function is 𝜇 = "'& !
Example 1: Fit a constant to 0-1 data (estimate a proportion)
This example was used previously in Likelihood lecture. My goal here is to connect
what glm() does with what we did by brute force previously.
Fatouros et al. (2005) carried out trials to determine whether the wasps can
distinguish mated female butterflies from unmated females. In each trial a single
wasp was presented with two female cabbage white butterflies, one a virgin
female, the other recently mated.
Y = 23 successes
n = 32 trials
Goal: estimate p
Use glm() to fit a constant, and so obtain the ML estimate of p
Fits a model having only a constant. Use the link function appropriate for binary
data:
𝜇
log =𝛽
1−𝜇
μ here refers to the population proportion (p) but let’s stick with μ symbol here to
use consistent notation for generalized linear models.
#
&"
𝜇̂ = #
"'& "
Use summary() for estimation
summary(z)
2.5 % 97.5 %
0.5501812 0.8535933
0.550 ≤ p ≤ 0.853 is the same result we obtained last week for likelihood based
confidence intervals using likelihood function (more decimal places this week).
Avoid using summary() for hypothesis testing
summary(z)
The z-value (Wald statistic) and P-value test the null hypothesis that β = 0. This is
the same as a test of the null hypothesis that the true (population) proportion
μ = 0.5, because
𝑒!
!
= 0.5
1+𝑒
Agresti (2002, Categorical data analysis, 2nd ed., Wiley) says that for small to
moderate sample size, the Wald test is less reliable than the log-likelihood ratio
test.
Use anova() to test hypotheses
Last week we calculated the log-likelihood ratio test for these data “by hand”.
Here we’ll use glm() to accomplish the same task.
g(μ) = β0 + β1X
Linear predictor (right side of equation) is like an ordinary linear regression, with
intercept b0 and slope b1
g(μ) = β0 + β1X
summary(z)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74452 0.69206 -2.521 0.01171 *
concentration 0.03643 0.01119 3.255 0.00113 **
Numbers in red are the estimates of b0 and b1 (intercept and slope) which predict
log(μ / 1 – μ).
Number of Fisher Scoring iterations refers to the number of iterations used before
the algorithm used by glm() converged on the maximum likelihood solution.
The generalized linear model
Use
visreg(z,scale = ‘response’)
to get fitted curve and
confidence bands on the
original scale.
The generalized linear model
intercept 𝛽.! 0.03643
LD0! =− =− =− = 47.88
slope .
𝛽" −1.7445
library(MASS)
dose.p(z)
Dose SE
p = 0.5: 47.8805 8.168823
Use anova() to test hypotheses
Analysis of deviance table gives log-likelihood ratio test of the null hypothesis that
there is no differences among years in mean number of offspring.
anova(z, test="Chisq")
As with lm(), terms are tested using model comparison (always a “full” vs
“reduced” model). Default program of action is to fit terms sequentially (“Type 1
sums of squares”), just as with lm().
Advantages of generalized linear models
In second case, analyze the summary statistic (fraction surviving) with lm(). Or, fit
a generalized linear mixed models using glmm() in lme4 package.
Assumptions of generalized linear models
http://commons.wikimedia.org/wiki/
File:Song_Sparrow-27527-2.jpg
Two solutions:
1. Transform data: X’ = log(X + 1)
Log-linear regression (a.k.a. Poisson regression) uses the log link function
𝜂 is the response variable on the log scale (here, mean of each group on log scale).
𝜇̂ = 𝑒 ./
Uses a large-sample approximation (this is why degrees of freedom, df, are shown
as infinite). These confidence limits might not be accurate for small sample sizes.
Use anova() to test hypotheses
Analysis of deviance table gives log-likelihood ratio test of the null hypothesis that
there is no differences among years in mean number of offspring.
anova(z, test="Chisq")
As with lm(), terms are tested using model comparison (always a “full” vs
“reduced” model). Default program of action is to fit terms sequentially (“Type 1
sums of squares”), as with lm().
Evaluating assumptions of the glm() fit
Do the variances of the residuals correspond to those assumed by the chosen link
function?
The log link function assumes that the Y values are Poisson distributed at each X.
A key property of the Poisson distribution is that within each treatment group the
variance and mean are equal (i.e., the glm() dispersion parameter = 1). But real
data rarely show this.
Evaluating assumptions of the glm() fit
A central property of the Poisson distribution is that the variance and mean are
equal (i.e., the glm() dispersion parameter = 1).
In the workshop we will analyze an example where the problem is more severe than
in the case of the song sparrow data here.
Modeling excessive variance
The glm() procedure to accomplish over (or under) dispersion uses the observed
relationship between mean and variance rather than an explicit probability
distribution for the data. In the case of count data,
summary(z)
Estimate Std. Error t value Pr(>|t|)
Intercept) 0.24116 0.29649 0.813 0.41736
year1976 1.03977 0.34942 2.976 0.00344 **
year1977 0.96665 0.31946 3.026 0.00295 **
year1978 0.97700 0.31076 3.144 0.00203 **
year1979 -0.03572 0.32479 -0.110 0.91259
The dispersion parameter is reasonably close to 1 for these data. But typically it is much
larger than 1 for count data, so I recommend using family = quasipoisson.
Modeling excessive variance
The point estimates are identical with those obtained using family=poisson
instead, but the standard errors (and resulting confidence intervals) are wider.
anova(z, test="Chi")
glm() can handle data having other probability distributions than the ones used in
my examples, including exponential and gamma distributions.
Discussion paper for next week: