(Statistics in The Social and Behavioral Sciences) W. Holmes Finch, Jocelyn E. Bolin, Ken Kelley - Multilevel Modeling Using R, 3rd Edition-Chapman & Hall - CRC (2024)
(Statistics in The Social and Behavioral Sciences) W. Holmes Finch, Jocelyn E. Bolin, Ken Kelley - Multilevel Modeling Using R, 3rd Edition-Chapman & Hall - CRC (2024)
After reviewing standard linear models, the authors present the basics of
multilevel models and explain how to fit these models using R. They then
show how to employ multilevel modeling with longitudinal data and
demonstrate the valuable graphical options in R. The book also describes
models for categorical dependent variables in both single-level and multilevel
data.
The third edition of the book includes several new topics that were not
present in the second edition. Specifically, a new chapter has been included,
focusing on fitting multilevel latent variable modeling in the R environment.
With R, it is possible to fit a variety of latent variable models in the multilevel
context, including factor analysis, structural models, item response theory,
and latent class models. The third edition also includes new sections in
Chapter 11 describing two useful alternatives to standard multilevel models,
fixed effects models and generalized estimating equations. These approaches
are particularly useful with small samples and when the researcher
is interested in modeling the correlation structure within higher-level units
(e.g., schools). The third edition also includes a new section on mediation
modeling in the multilevel context in Chapter 11.
W. Holmes Finch
Jocelyn E. Bolin
Third edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish in this form has not been
obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information storage
or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-
750-8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
DOI: 10.1201/b23166
Typeset in Palatino
by MPS Limited, Dehradun
Contents
Preface......................................................................................................................ix
About the Authors ................................................................................................xi
1. Linear Models.................................................................................................1
Simple Linear Regression.............................................................................. 2
Estimating Regression Models with Ordinary Least Squares................ 2
Distributional Assumptions Underlying Regression ............................... 3
Coefficient of Determination ........................................................................ 4
Inference for Regression Parameters........................................................... 5
Multiple Regression ....................................................................................... 7
Example of Simple Linear Regression by Hand....................................... 9
Regression in R ............................................................................................. 11
Interaction Terms in Regression ................................................................ 14
Categorical Independent Variables ........................................................... 15
Checking Regression Assumptions with R.............................................. 18
Summary ........................................................................................................ 21
v
vi Contents
Additional Options....................................................................................... 57
Parameter Estimation Method............................................................ 57
Estimation Controls.............................................................................. 58
Comparing Model Fit .......................................................................... 58
Lme4 and Hypothesis Testing............................................................ 59
Summary ........................................................................................................ 63
References............................................................................................................ 318
Index .....................................................................................................................322
Preface
The goal of this third edition of the book is to provide you, the reader, with
a comprehensive resource for the conduct of multilevel modeling using the
R software package. Multilevel modeling, sometimes referred to as
hierarchical modeling, is a powerful tool that allows the researcher to
account for data collected at multiple levels. For example, an educational
researcher might gather test scores and measures of socioeconomic status
(SES) for students who attend a number of different schools. The students
would be considered level-1 sampling units, and the schools would be
referred to as level-2 units. Ignoring the structure inherent in this type of
data collection can, as we discuss in Chapter 2, lead to incorrect parameter
and standard error estimates. In addition to modeling the data structure
correctly, we will see in the following chapters that the use of multilevel
models can also provide us with insights into the nature of relationships in
our data that might otherwise not be detected. After reviewing standard
linear models in Chapter 1, we will turn our attention to the basics of
multilevel models in Chapter 2, before learning how to fit these models
using the R software package in Chapters 3 and 4. Chapter 5 focuses on the
use of multilevel modeling in the case of longitudinal data, and Chapter 6
demonstrates the very useful graphical options available in R, particularly
those most appropriate for multilevel data. Chapters 7 and 8 describe
models for categorical dependent variables, first for single-level data, and
then in the multilevel context. In Chapter 9, we describe an alternative to
standard maximum likelihood estimation of multilevel models in the form
of the Bayesian framework. Chapter 10 moves the focus from models for
observed variables and places it on dealing with multilevel structure in the
context of models in which the variables of interest are latent or
unobserved. In this context, we deal with multilevel models for factor
analysis, structural equation models, item response theory, and latent class
analysis. We conclude the book with two chapters dealing with advanced
topics in multilevel modeling such as fixed effects models, generalized
estimating equations, mediation for multilevel data, penalized estimators,
and nonlinear relationships, as well as robust estimators, outlier detection,
prediction of level-2 outcomes with level-1 variables, and power analysis for
multilevel models.
The third edition of the book includes several new topics that were not
present in the second edition. Specifically, we have included a new chapter (10)
focused on fitting multilevel latent variable modeling in the R environment.
With R, it is possible to fit a variety of latent variable models in the multilevel
context, including factor analysis, structural models, item response theory,
and latent class models. The third edition also includes new sections in
ix
x Preface
xi
1
Linear Models
DOI: 10.1201/b23166-1 1
2 Multilevel Modeling Using R
regression and using R to conduct such analyses may elect to skip this chapter
with no loss of understanding in future chapters.
yi = 0 + 1 xi + i (1.1)
yˆi = 0 + 1 xi (1.2)
from the population. There exists in the statistical literature several methods
for obtaining estimated values of the regression model parameters (b0 and
b1, respectively) given a set of x and y. By far, the most popular and widely
used of these methods is ordinary least squares (OLS). A vast number of
other approaches are useful in special cases involving small samples or data
that do not conform to the distributional assumptions undergirding OLS.
The goal of OLS is to minimize the sum of the squared differences between
the observed values of y and the model predicted values of y, across the
sample. This difference, known as the residual, is written as
ei = yi yˆi (1.3)
n n
ei2 = (yi yˆi )2 (1.4)
i =1 i =1
The actual mechanism for finding the linear equation that minimizes the sum
of squared residuals involves the partial derivatives of the sum of squared
function with respect to the model coefficients, β0 and β1. We will leave these
mathematical details to excellent references such as Fox (2016). It should be
noted that in the context of simple linear regression, the OLS criteria reduce to
the following equations, which can be used to obtain b0 and b1 as
sy
b1 = r (1.5)
sx
and
b0 = y b1 x (1.6)
true based on the sample data. The first assumption that must hold true for
linear models to function optimally is that the relationship between yi and xi is
linear. If the relationship is not linear, then clearly an equation for a line will
not provide adequate fit and the model is thus misspecified. The second
assumption is that the variance in the residuals is constant regardless of the
value of xi. This assumption is typically referred to as homoscedasticity and is
a generalization of the homogeneity of error variance assumption in ANOVA.
Homoscedasticity implies that the variance of yi is constant across values of xi.
The distribution of the dependent variable around the regression line is
literally the distribution of the residuals, thus making clear the connection of
homoscedasticity of errors with the distribution of yi around the regression
line. The third assumption is that the residuals are normally distributed in the
population. Fourth, it is assumed that the independent variable x is measured
without error and that it is unrelated to the model error term, ε. It should be
noted that the assumption of x measured without error is not as strenuous as
one might first assume. In fact, for most real-world problems, the model will
work well even when the independent variable is not error-free (Fox, 2016).
Fifth and finally, the residuals for any two individuals in the population are
assumed to be independent of one another. This independence assumption
implies that the unmeasured factors influencing y are not related from one
individual to another. It is this assumption that is directly addressed with the
use of multilevel models, as we will see in Chapter 2. In many research
situations, individuals are sampled in clusters such that we cannot assume
that individuals from the same such cluster will have uncorrelated residuals.
For example, if samples are obtained from multiple neighborhoods, indivi-
duals within the same neighborhoods may tend to be more like one another
than they are like individuals from other neighborhoods. A prototypical
example of this is children within schools. Due to a variety of factors, children
attending the same school will often have more in common with one another
than they do with children from other schools. These “common” things might
include neighborhood socioeconomic status, school administration policies,
and school learning environment, to name just a few. Ignoring this clustering,
or not even realizing it is a problem, can be detrimental to the results of
statistical modeling. We explore this issue in great detail later in the book, but
for now we simply want to mention that a failure to satisfy the assumption of
independent errors is (a) a major problem but (b) often something that can be
overcome with appropriate models such as multilevel models that explicitly
consider the nesting of the data.
Coefficient of Determination
When the linear regression model has been estimated, researchers generally
want to measure the relative magnitude of the relationship between the
Linear Models 5
n ˆi n
SSR i =1 (y y¯)2 i =1 (yi yˆ )2 SSE
R2 = = n 2
1 n
=1 (1.7)
SST i =1 (yi y¯) i =1 (yi y¯)2 SST
The terms in equation (1.7) are as defined previously. The value of this
statistic always lies between 0 and 1, with larger numbers indicating a
stronger linear relationship between x and y, implying that the independent
variable is able to account for more variance in the dependent variable. R2 is
a very commonly used measure of the overall fit of the regression model
and, along with the parameter inference discussed below, serves as the
primary mechanism by which the relationship between the two variables is
quantified.
In order to obtain sb1, we must first calculate the variance of the residuals,
n 2
i =1 ei
Se2 = (1.8)
N p 1
where ei is the residual value for individual i, N is the sample size, and p is
the number of independent variables (1 in the case of simple regression).
Then
1 Se
Sb1 = (1.9)
n
1 R2 i =1 (xi x¯)2
n 2
i =1 xi
Sb0 = Sb1 (1.10)
n
Given that the sample intercept and slope are only estimates of the
population parameters, researchers are quite often interested in testing
hypotheses to infer whether the data represent a departure from what
would be expected in what is commonly referred to as the null case, that the
idea of the null value holding true in the population can be rejected. Most
frequently (though not always), the inference of interest concerns testing
that the population parameter is 0. In particular, a non-zero slope in the
population means that x is linearly related to y. Therefore, researchers
typically are interested in using the sample to make inference of whether
the population slope is 0 or not. Inference can also be made regarding the
intercept, and again the typical focus is on whether this value is 0 in the
population.
Inference about regression parameters can be made using confidence
intervals and hypothesis tests. Much as with the confidence interval of the
mean, the confidence interval of the regression coefficient yields a range of
values within which we have some level of confidence (e.g. 95%) that the
population parameter value resides. If our particular interest is in whether x
is linearly related to y, then we would simply determine whether 0 is in the
interval for β1. If so, then we would not be able to conclude that the
population value differs from 0. The absence of a statistically significant
result (i.e. an interval not containing 0) does not imply that the null
hypothesis is true, but rather it means that there is no sufficient evidence
available in the sample data to reject the null. Similarly, we can construct a
confidence interval for the intercept, and if 0 is within the interval, we
would conclude that the value of y for an individual with x = 0 could
Linear Models 7
plausibly be but is not necessarily 0. The confidence intervals for the slope
and intercept take the following forms:
and
Here, the parameter estimates and their standard errors are as described
previously, while tcv is the critical value of the t distribution for 1 − α/2
(e.g., the 0.975 quantile if α = 0.05) with n − p − 1 degrees of freedom. The
value of α is equal to 1 minus the desired level of confidence. Thus, for a
95% confidence interval (0.95 level of confidence), α would be 0.05.
In addition to confidence intervals, inference about the regression
parameters can be made using hypothesis tests. In general, the forms of
this test for the slope and intercept, respectively, are
b1 1
tb1 = (1.13)
Sb1
b0 0
tb0 = (1.14)
Sb0
The terms β1 and β0 are the parameter values under the null hypothesis.
Again, most often the null hypothesis posits that there is no linear relation-
ship between x and y (β1 = 0) and that the value of y = 0 when x = 0 (β0 = 0). For
simple regression, each of these tests is conducted with n − 2 degrees of
freedom.
Multiple Regression
The linear regression model can be easily extended to allow for multiple
independent variables at once. In the case of two regressors, the model takes
the form
In many ways, this model is interpreted as that for simple linear regression.
The only major difference between simple and multiple regression interpre-
tation is that each coefficient is interpreted in turn holding constant the value of
8 Multilevel Modeling Using R
SSR / p n p 1 R2
F= = (1.16)
SSE /(n p 1) p 1 R2
Here, terms are as defined in equation (1.7). This test statistic is distributed
as an F with p and n − p − 1 degrees of freedom. A statistically significant
result would indicate that one or more of the regression coefficients are not
equal to 0 in the population. Typically, the researcher would then refer to
the tests of individual regression parameters, which were described above,
in order to identify which were not equal to 0.
A second issue to be considered by researchers in the context of multiple
regression is the notion of adjusted R2. Stated simply, the inclusion of
additional independent variables in the regression model will always yield
higher values of R2, even when these variables are not statistically
significantly related to the dependent variable. In other words, there is a
capitalization on chance that occurs in the calculation of R2. As a conse-
quence, models including many regressors with negligible relationships
with y may produce an R2 that would suggest the model explains a great
deal of variance in y. An option for measuring the variance explained in the
dependent variable that accounts for this additional model complexity
would be quite helpful to the researcher seeking to understand the true
nature of the relationship between the set of independent variables and
dependent variables. Such a measure exists in the form of the adjusted R2
value, which is commonly calculated as
Linear Models 9
n 1
RA2 = 1 (1 R2 ) (1.17)
n p 1
RA2 only increases with the addition of an x if that x explains more variance
than would be expected by chance. RA2 will always be less than or equal to
the standard R2. It is generally recommended to use this statistic in practice
when models containing many independent variables are used.
A final important issue specific to multiple regression is that of collinearity,
which occurs when one independent variable is a linear combination of one
or more of the other independent variables. In such a case, regression
coefficients and their corresponding standard errors can be quite unstable,
resulting in poor inference. It is possible to investigate the presence of
collinearity using a statistic known as the variance inflation factor (VIF). In
order to obtain the VIF for xj, we would first regress all of the other
independent variables onto xj and obtain an R 2xi value. We then calculate
1
VIF = (1.18)
1 Rx2
The VIF will become large when Rxi2 is near 1, indicating that xj has very little
unique variation when the other independent variables in the model are
considered. That is, if the other p − 1 regressors can explain a high proportion
of xj, then xj does not add much to the model, above and beyond the other p − 1
regression. Collinearity in turn leads to high sampling variation in bj, resulting in
large standard errors and unstable parameter estimates. Conventional rules of
thumb have been proposed for determining when an independent variable is
highly collinear with the set of other p − 1 regressors. Thus, the researcher might
consider collinearity to be a problem if VIF > 5 or 10 (Fox, 2016). The typical
response to collinearity is to either remove the offending variable(s) or use an
alternative approach to conducting the regression analysis such as ridge
regression or regression following a principal components analysis.
TABLE 1.1
Descriptive Statistics and Correlation for GPA and Test Anxiety
Variable Mean Standard Deviation Correlation
is the independent variable. The descriptive statistics for each variable and
the correlation between the two appear in Table 1.1.
We can use this information to obtain estimates for both the slope and
intercept of the regression model using equations (1.4) and (1.5). First, the
0.51
( )
slope is calculated as b1 = 0.30 10.83 = 0.014, indicating that individuals
with higher test anxiety scores will generally have lower GPAs. Next, we can
use this value and information in the table to calculate the intercept estimate:
440 1 1 0.09
F= = 438(0.10) = 43.80
1 1 0.09
This test has p and n − p − 1 degrees of freedom, or 1 and 438 in this situation.
The p-value of this test is less than 0.001, leading us to conclude that the slope
in the population is indeed significantly different from 0 because the p-value
is less than the Type I error rate specified. Thus, test anxiety is linearly related
to GPA. The same inference could be conducted using the t-test for the
slope. First, we must calculate the standard error of the slope estimate:
1 SE
Sb1 = 1 R2
.
(xi x¯)2
104.71
For these data SE = = 0.24 = 0.49. In turn, the sum of squared
440 1 1
deviations for x (anxiety) was 53743.64, and we previously calculated
Linear Models 11
The fact that 0 is not in the 95% confidence interval simply supports the
conclusion we reached using the p-value as described above. Also, given
this interval, we can infer that the actual population slope value lies
between −0.018 and −0.010. Thus, anxiety could plausibly have an effect
as small as little as −0.01 or as large as −0.018.
Regression in R
In R, the function call for fitting linear regression is lm, which is part of the
stats library that is loaded by default each time R is started on your
computer. The basic form for a linear regression model using lm is:
lm(formula, data)
where formula defines the linear regression form and data indicate the
dataset used in the analysis, examples of which appear below. Returning to
the previous example, predicting GPA from measures of physical (BStotal)
and cognitive academic anxiety (CTA.tot), the model is defined in R as:
Model1.1 <- lm(GPA ~ CTA.tot + BStotal, Cassidy)
This line of R code is referred to as a function call, and defines the regression
equation. The dependent variable, GPA, is followed by the independent variables
CTA.tot and BStotal, separated by ~. The dataset, Cassidy, is also given
here, after the regression equation has been defined. Finally, the output from this
12 Multilevel Modeling Using R
analysis is stored in the object Model1.1. In order to view this output, we can
type the name of this object in R, and hit return to obtain the following:
Call:
lm(formula = GPA ~ CTA.tot + BStotal, data = Cassidy)
Coefficients:
(Intercept) CTA.tot BStotal
3.61892 −0.02007 0.01347
The output obtained from the basic function call will return only values for
the intercept and slope coefficients, lacking information regarding model fit
(e.g. R2) and significance of model parameters. Further information on our
model can be obtained by requesting a summary of the model.
summary(Model1.1)
Call:
lm(formula = GPA ~ CTA.tot + BStotal, data = Cassidy)
Residuals:
Min 1Q Median 3Q Max
−2.99239 −0.29138 0.01516 0.36849 0.93941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.618924 0.079305 45.633 < 2e−16 ***
CTA.tot −0.020068 0.003065 −6.547 1.69e−10 ***
BStotal 0.013469 0.005077 2.653 0.00828 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the model summary, we can obtain information on model fit (overall
F test for significance, R2 and standard error of the estimate), parameter
significance tests, and a summary of residual statistics. As the F test for
the overall model is somewhat abbreviated in this output, we can request
the entire ANOVA result, including sums of squares and mean squares, by
using the anova(Model1.1) function call.
Linear Models 13
$names
[1] "coefficients" "residuals" "effects" "rank"
"fitted.values"
[6] "assign" "qr" "df.residual" "na.action"
"xlevels"
[11] "call" "terms" "model"
$class
[1] "lm"
This is a list of attributes or information that can be pulled out of the fitted
regression model. In order to obtain this information from the fitted model,
we can call for the particular attribute. For example, if we would like to
obtain the predicted GPAs for each individual in the sample, we would
simply type the following followed by the enter key:
Model1.1$fitted.values
1 3 4 5 8 9 10 11 12
2.964641 3.125996 3.039668 3.125454 2.852730 3.152391 3.412460 3.011917 2.611103
13 14 15 16 17 19 23 25 26
3.158448 3.298923 3.312121 2.959938 3.205183 2.945928 2.904979 3.226064 3.245318
27 28 29 30 31 34 35 37 38
2.944573 3.171646 2.917635 3.198584 3.206267 3.073204 3.258787 3.118584 2.972594
39 41 42 43 44 45 46 48 50
2.870630 3.144980 3.285454 3.386064 2.871713 2.911849 3.166131 3.051511 3.251917
Thus, for example, the predicted GPA for subject 1 based on the prediction
equation would be 2.96. By the same token, we can obtain the regression
residuals with the following command:
14 Multilevel Modeling Using R
Model1.1$residuals
1 3 4 5 8 9
−0.4646405061 −0.3259956916 −0.7896675749 −0.0254537419 0.4492704297 −0.0283914353
10 11 12 13 14 15
−0.1124596847 −0.5119169570 0.0888967457 −0.6584484215 −0.7989228998 −0.4221207716
16 17 19 23 25 26
−0.5799383942 −0.3051829226 −0.1459275978 −0.8649791080 0.0989363702 -0.2453184879
27 28 29 30 31 34
−0.4445727235 0.7783537067 −0.8176350301 0.1014160133 0.3937331779 −0.1232042042
35 37 38 39 41 42
0.3412126654 0.4814161689 0.9394056837 −0.6706295541 −0.5449795748 −0.4194540531
43 44 45 46 48 50
−0.4960639410 −0.0717134535 −0.4118490187 0.4338687432 0.7484894275 0.4480825762
From this output, we can see that the predicted GPA for the first individual
in the sample was approximately 0.465 points above the actual GPA (i.e.,
Observed GPA – Predicted GPA = 2.5 − 2.965).
Residuals:
Min 1Q Median 3Q Max
−2.98711 −0.29737 0.01801 0.36340 0.95016
Coefficients:
summary(Model1.3)
Call:
lm(formula = GPA ~ CTA.tot + Male, data = Acad)
Residuals:
Min 1Q Median 3Q Max
−3.01149 -0.29005 0.03038 0.35374 0.96294
Coefficients:
In this example, the slope for the dummy variable Male is negative and
significant ( β = −0.223, p < .001) indicating that males have a significantly
lower mean GPA than do females.
Depending on the format in which the data are stored, the lm function is
capable of dummy coding categorical variables itself. If the variable has
been designated as categorical (this often happens if you have read your
data from an SPSS file in which the variable is designated as such), then
when the variable is used in the lm function it will automatically dummy
code the variable for you in your results. For example, if instead of using the
Male variable as described above, we used Gender as a categorical variable
coded as female and male, we would obtain the following results from the
model specification and summary commands.
summary(Model1.4)
Call:
lm(formula = GPA ~ CTA.tot + Gender, data = Acad)
Residuals:
Min 1Q Median 3Q Max
−3.01149 −0.29005 0.03038 0.35374 0.96294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.740318 0.080940 46.211 < 2e−16 ***
CTA.tot −0.015184 0.002117 −7.173 3.16e−12 ***
Gender[T.male] -0.222594 0.047152 −4.721 3.17e−06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
difference between the two sets of results is that for Model1.4 R has
reported the slope as Gender[t.male] indicating that the variable has
been automatically dummy coded so that male is 1 and not male is 0.
In the same manner, categorical variables consisting of more than two
categories can also be easily incorporated into the regression model, either
through the direct use of the categorical variable or dummy coded prior to
analysis. In the following example, the variable Ethnicity includes three
possible groups, African American, Other, and Caucasian. By including this
variable in the model call, we are implicitly requesting that R automatically
dummy code it for us.
summary(GPAmodel1.5)
Call:
lm(formula = GPA ~ CTA.tot + Ethnicity, data = Acad)
Residuals:
Min 1Q Median 3Q Max
−2.95019 −0.30021 0.01845 0.37825 1.00682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.670308 0.079101 46.400 < 2e−16 ***
CTA.tot −0.015002 0.002147 −6.989 1.04e−11 ***
Ethnicity[T.African American] −0.482377 0.131589 −3.666 0.000277 ***
Ethnicity[T.Other] −0.151748 0.136150 −1.115 0.265652
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given that we have slopes for African American and Other, we know that
Caucasian serves as the reference category, which is coded as 0. Results
indicate a significant negative slope for African American (β = −0.482, p <
0.001), and a non-significant slope for Other (β = −.152, p > .05) indicating
that African Americans have a significantly lower GPA than Caucasians but
the Other ethnicity category was not significantly different from Caucasian
in terms of GPA.
Finally, let us consider some issues associated with allowing R to auto
dummy code categorical variables. First, R will always auto dummy code
the first category listed as the reference category. If a more theoretically
suitable dummy coding scheme is desired, it will be necessary to either
18 Multilevel Modeling Using R
order the categories so that the desired reference category is first, or simply
recode into dummy variables manually. Also, it is important to keep in
mind that auto dummy coding only occurs when the variable is labeled in
the system as categorical. This will occur automatically if the categories
themselves are coded as letters. However, if a categorical variable is coded
1,2 or 1, 2, 3 but not specifically designated as categorical, the system will
view it as continuous and treat it as such. In order to ensure that a variable
is treated as categorical when that is what we desire, we simply use the
as.factor command. For the Male variable, in which males are coded as 1
and females as 0, we would type the following:
Male<-as.factor(Male)
Library(car)
residualPlots(Model1.1)
FIGURE 1.1
Diagnostic residuals plots for regression model predicting GPA from CTA.tot and BStotal.
residualPlots(Model1.1)
qqPlot(Model1.1)
Linear Models 21
0
Studentized Residuals(GPAmodel.1)
-2
-4
-6
-3 -2 -1 0 1 2 3
t Quantiles
Summary
This chapter introduced the reader to the basics of linear modeling using R. This
treatment was purposely limited, as there are a number of good texts available
on this subject, and it is not the main focus of this book. However, many of the
core concepts presented here for the GLM apply to multilevel modeling as well
and thus are of key importance as we move into more complex analyses. In
addition, much of the syntactical framework presented here will reappear in
subsequent chapters. In particular, readers should leave this chapter comfort-
able with interpretation of coefficients in linear models, as well as the concept of
variance explained in an outcome variable. We would encourage you to return
to this chapter frequently as needed in order to reinforce these basic concepts. In
addition, we would recommend that you also refer to the initial chapter dealing
with the basics of using R when questions regarding data management and
installation of specific R libraries become an issue. Next, in Chapter 2 we will
turn our attention to the conceptual underpinnings of multilevel modeling
before delving into their estimation in Chapters 3 and 4.
2
An Introduction to Multilevel
Data Structure
22 DOI: 10.1201/b23166-2
An Introduction to Multilevel Data Structure 23
treatment condition (in this case the school) would have an additional
impact on the outcome variable.
We typically refer to the data structure described above as nested,
meaning that individual data points at one level (e.g., student) appear in
only one level of a higher level variable such as school. Thus, students are
nested within school. Such designs can be contrasted with a crossed data
structure whereby individuals at the first level appear in multiple levels of
the second variable. In our example, students might be crossed with after-
school organizations if they are allowed to participate in more than one. For
example, a given student might be on the basketball team as well as in the
band. The focus of this book is almost exclusively on nested designs, which
give rise to multilevel data. Other examples of nested designs might
include a survey of job satisfaction for employees from multiple depart-
ments within a large business organization. In this case, each employee
works within only a single division in the company, leading to a nested
design. Furthermore, it seems reasonable to assume that employees
working within the same division will have correlated responses on the
satisfaction survey, as much of their view regarding the job would be based
exclusively upon experiences within their division. For a third such
example, consider the situation in which clients of several psychotherapists
working in a clinic are asked to rate the quality of each of their therapy
sessions. In this instance, there exist three levels in the data: time, in the
form of individual therapy session, client, and therapist. Thus, session is
nested in client, who in turn is nested within therapist. All of this data
structure would be expected to lead to correlated scores on a therapy rating
instrument.
Intraclass Correlation
In cases where individuals are clustered or nested within a higher level unit
(e.g., classrooms, schools, school districts), it is possible to estimate the
correlation among individual’s scores within the cluster/nested structure
using the intraclass correlation (denoted ρΙ in the population). The ρΙ is a
measure of the proportion of variation in the outcome variable that occurs
between groups versus the total variation present and ranges from 0 (no
variance between clusters) to 1 (variance between clusters but not within
cluster variance). ρΙ can also be conceptualized as the correlation for the
dependent measure for two individuals randomly selected from the same
cluster. It can be expressed as
2
I = 2 2
(2.1)
+
24 Multilevel Modeling Using R
where
Higher values of ρΙ indicate that a greater share of the total variation in the
outcome measure is associated with cluster membership, i.e., there is a
relatively strong relationship among the scores for two individuals from the
same cluster. Another way to frame this issue is that individuals within the
same cluster (e.g., school) are more alike on the measured variable than they
are like those in other clusters.
It is possible to estimate τ2 and σ2 using sample data and thus is also
possible to estimate ρΙ. Those familiar with ANOVA will recognize these
estimates as being related (though not identical) to the sum of squared
terms. The sample estimate for variation within clusters is simply
C
j =1 (n j 1) Sj2
ˆ2 = (2.2)
N C
where
nj
j =1 (yij y¯j )
Sj2 = variance within cluster j = nj 1
nj = sample size for cluster j
N = total sample size
C = total number of clusters
2
C nj (y¯j y¯)2
SˆB = (2.3)
j =1 n˜ (C 1)
where
C 2
1 j =1 n j
n˜ = N
C 1 N
An Introduction to Multilevel Data Structure 25
Using these variance estimates, we can in turn calculate the sample estimate
of ρΙ:
ˆ2
ˆI = . (2.5)
ˆ2 + ˆ2
Note that equation (2.5) assumes that the clusters are of equal size. Clearly, such
will not always be the case, in which case this equation will not hold. However,
the purpose of its inclusion here is to demonstrate the principle underlying the
estimation of ρΙ, which holds even as the equation might change.
In order to illustrate estimation of ρΙ, let us consider the following dataset.
Achievement test data were collected from 10,903 third-grade examinees
nested within 160 schools. School sizes range from 11 to 143, with a mean size
of 68.14. In this case, we will focus on the reading achievement test score and
will use data from only five of the schools, in order to make the calculations
by hand easy to follow. First, we will estimate ˆ 2 . To do so, we must estimate
the variance in scores within each school. These values appear in Table 2.1.
Using these variances and sample sizes, we can calculate ˆ 2 as
C
j =1 (nj 1) Sj2
ˆ2 =
N C
(58 1)5.3 + (29
1)1.5 + (64 1)6.1 + (39 1)6.1 + (88 1)3.4
=
278 5
302.1 + 42 + 182.7 + 231.8 + 295.8 1054.4
= = = 3.9
273 273
TABLE 2.1
School Size, Mean, and Variance of Reading Achievement Test
School N Mean Variance
The school means, which are needed in order to calculate SB2 , appear in
Table 2 as well. First, we must calculate ñ:
C 2
1 j =1 nj 1 582 + 292 + 642 + 392 + 882
n˜ = N = 278
C 1 N 5 1 278
1
= (278 63.2) = 53.7
4
Using this value, we can then calculate SB2 for the five schools in our small
sample using equation (2.3):
3.9
0.140 = 0.140 0.073 = 0.067
53.7
We have now calculated all of the parts that we need to estimate ρI for the
population,
0.067
ˆI = = 0.017
0.067 + 3.9
This result indicates that there is very little correlation of examinees’ test
scores within the schools. We can also interpret this value as the proportion
of variation in the test scores that is accounted for by the schools.
Given that ˆI is a sample estimate, we know that it is subject to sampling
variation, which can be estimated with a standard error as follows:
2
S I = (1 I )(1 + (n 1) I ) . (2.6)
n (n 1)(N 1)
The terms in (2.6) are as defined previously, and the assumption is that all
clusters are of equal size. As noted earlier in the chapter, this latter
condition is not a requirement, however, and an alternative formulation
An Introduction to Multilevel Data Structure 27
exists for cases in which it does not hold. However, (2.6) provides sufficient
insight for our purposes into the estimation of the standard error of the ICC.
The ICC is an important tool in multilevel modeling, in large part because
it is an indicator of the degree to which the multilevel data structure might
impact the outcome variable of interest. Larger values of the ICC are
indicative of a greater impact of clustering. Thus, as the ICC increases in
value, we must be more cognizant of employing multilevel modeling
strategies in our data analysis. In the next section, we will discuss the
problems associated with ignoring this multilevel structure, before we turn
our attention to methods for dealing with it directly.
among variables at different levels. This greater model complexity in turn may
lead to greater understanding of the phenomenon under study.
Random Intercept
As we transition from the level-1 regression framework of Chapter 1 to the
MLM context, let’s first revisit the basic simple linear regression model of
equation (1.1), y = 0 + 1 x + . Here, the dependent variable y is expressed
as a function of an independent variable x, multiplied by a slope coefficient
1, an intercept 0 , and random variation from subject to subject . We
defined the intercept as the conditional mean of y when the value of x is 0.
In the context of a single-level regression model such as this, there is one
intercept that is common to all individuals in the population of interest.
However, when individuals are clustered together in some fashion (e.g.,
within classrooms, schools, organizational units within a company), there
will potentially be a separate intercept for each of these clusters, that is,
there may be different means for the dependent variable for x = 0 across the
different clusters. We say potentially here because if there is in fact no cluster
effect, then the single intercept model of (1.1) will suffice. In practice,
assessing whether there are different means across the clusters is an
empirical question, which we describe below. It should be noted that in
this discussion we are considering only the case where the intercept is
cluster specific, but it is also possible for β1 to vary by group, or even other
coefficients from more complicated models.
Allowing for group-specific intercepts and slopes leads to the following
notation commonly used for the level-1 (micro level) model in multilevel
modeling:
An Introduction to Multilevel Data Structure 29
yij = 0j + 1j + ij (2.7)
where the subscripts ij refer to the ith individual in the jth cluster. As we
continue our discussion of multilevel modeling notation and structure, we
will begin with the most basic multilevel model: predicting the outcome
from just an intercept which we will allow to vary randomly for each group
yij = 0j + ij . (2.8)
0j = 00 + U0j (2.9)
y= 00 + U0j + 1x + (2.10)
Equation (2.10) is termed the full or composite model in which the multiple
levels are combined into a unified equation.
Often in MLM, we begin our analysis of a dataset with this simple
random intercept model, known as the null model, which takes the form
While the null model does not provide information regarding the impact of
specific independent variables on the dependent variables, it does yield
important information regarding how variation in y is partitioned between
variance among the individuals σ2 and variance among the clusters τ2. The
total variance of y is simply the sum of σ2 and τ2. In addition, as we have already
seen, these values can be used to estimate ρI. The null model, as will be seen in
later sections, is also used as a baseline for model building and comparison.
Random Slopes
It is a simple matter to expand the random intercept model in (2.9) to
accommodate one or more independent predictor variables. As an example,
if we add a single predictor (xij) at the individual level (level 1) to the model,
we obtain
1j = 10 (2.15)
This model now includes the predictor and the slope relating it to the
dependent variable γ10, which we acknowledge as being at level 1 by the
subscript 10. We interpret γ10 in the same way that we did β1 in the linear
regression model, i.e., a measure of the impact on y of a 1 unit change in x.
In addition, we can estimate ρI exactly as before, though now it reflects the
correlation between individuals from the same cluster after controlling for
the independent variable x. In this model, both 10 and 00 are fixed effects,
while σ2 and τ2 remain random.
One implication of the model in (2.12) is that the dependent variable is
impacted by variation among individuals (σ2), variation among clusters (τ2),
an overall mean common to all clusters ( 00), and the impact of the
independent variable as measured by 10, which is also common to all
clusters. In practice, there is no reason that the impact of x on y would need
to be common for all clusters, however. In other words, it is entirely possible
that rather than having a single 10 common to all clusters, there is actually a
unique effect for the cluster of 10 + U1j , where 10 is the average relationship
of x with y across clusters, and U1j is the cluster-specific variation of the
relationship between the two variables. This cluster-specific effect is
An Introduction to Multilevel Data Structure 31
assumed to have a mean of 0 and to vary randomly around γ10. The random
slopes model is
yij = 00 + 10 xij + U0j + U1j xij + ij (2.16)
Written in this way, we have separated the model into its fixed ( 00 + 10 xij )
and random (U0j + U1j xij + ij ) components. Model (2.16) simply states that
there is an interaction between cluster and x such that the relationship of x
and y is not constant across clusters.
Heretofore we have discussed only one source of between-group varia-
tion, which we have expressed as τ2, and which is the variation among
clusters in the intercept. However, model (2.16) adds a second such source
of between-group variance in the form of U1j, which is cluster variation on
the slope relating the independent and dependent variables. In order to
differentiate between these two sources of between-group variance, we now
denote the variance of U0j as 02 and the variance of U1j as 12 . Furthermore,
within clusters, we expect U1j and U0j to have a covariance of 01. However,
across different clusters, these terms should be independent of one another,
and in all cases it is assumed that ε remains independent of all other model
terms. In practice, if we find that 12 is not 0, we must be careful in describing
the relationship between the independent and dependent variables, as it is
not the same for all clusters. We will revisit this idea in subsequent chapters.
For the moment, however, it is most important to recognize that variation in
the dependent variable, y, can be explained by several sources, some fixed
and others random. In practice, we will most likely be interested in
estimating all of these sources of variability in a single model.
As a means for further understanding the MLM, let’s consider a simple
example using the five schools described above. In this context, we are
interested in treating a reading achievement test score as the dependent
variable and a vocabulary achievement test score as the independent
variable. Remember that students are nested within schools so that a simple
regression analysis will not be appropriate. In order to understand what is
being estimated in the context of MLM, we can obtain separate intercept
and slope estimates for each school which appear in Table 2.2.
Given that the schools are of the same sample size, the estimate of 00, the
average intercept value is 2.359, and the estimate of the average slope value
10 is 0.375. Notice that for both parameters, the school values deviate from
these means. For example, the intercept for school 1 is 1.230. The difference
between this value and 2.359, −1.129, is U0j for that school. Similarly, the
difference between the average slope value of 0.375 and the slope for school
1, 0.552, is 0.177, which is U1j for this school. Table 2.2 includes U0j and U1j
values for each of the schools. The differences in slopes also provide
information regarding the relationship between vocabulary and reading test
scores. For all of the schools, this relationship was positive, meaning that
students who scored higher on vocabulary also scored higher on reading.
32 Multilevel Modeling Using R
TABLE 2.2
Regression intercepts, slopes, and error terms for schools
School Intercept U0j Slope U1j
However, the strength of this relationship was weaker for school 2 than for
school 1, as an example.
Given the values in Table 2.2, it is also possible to estimate the variances
associated with U1j and U0j , 12 and 02 , respectively. Again, because the
schools in this example had the same number of students, the calculation of
these variances is a straightforward matter, using
(U1j U¯ 1)2
(2.17)
J 1
for the slopes, and an analogous equation for the intercept random variance.
Doing so, we obtain 02 = 0.439 and 12 = 0.016. In other words, much more of
the variance in the dependent variable is accounted for by variation in the
intercepts at the school level than is accounted for by variation in the slopes.
Another way to think of this result is that the schools exhibited greater
differences among one another in the mean level of achievement as
compared to differences in the impact of x on y.
The actual practice of obtaining these variance estimates using the R
environment for statistical computing and graphics and interpreting their
meaning is the subject for the coming chapters. Before discussing the
practical nuts and bolts of conducting this analysis, we will first examine the
basics of how parameters are estimated in the MLM framework using
maximum likelihood and restricted maximum likelihood algorithms. While
similar in spirit to the simple calculations demonstrated above, they are
different in practice and will yield somewhat different results than those we
would obtain using least squares, as above. Prior to this discussion,
however, there is one more issue that warrants our attention as we consider
the practice of MLM, namely variable centering.
Centering
Centering simply refers to the practice of subtracting the mean of a variable
from each individual value. This implies the mean for the sample of the
An Introduction to Multilevel Data Structure 33
values of vocabulary in the analysis would mean that we are investigating the
relationship between one’s relative vocabulary score in his or her school and
his or her reading score. In contrast, the use of grand mean centering would
examine the relationship between one’s relative standing in the sample as a
whole on vocabulary and the reading score. This latter interpretation would
be equivalent conceptually (though not mathematically) to using the raw
score, while the group mean centering would not. Throughout the rest of this
book, we will use grand mean centering by default, per recommendations by
Hox (2002), among others. At times, however, we will also demonstrate the
use of group mean centering in order to illustrate how it provides different
results, and for applications in which interpretation of the impact of an
individual’s relative standing in their cluster might be more useful than their
relative standing in the sample as a whole.
observed data (i.e., produce predicted values that are as close as possible to
the observed), and as such can be computationally intensive, particularly
for complex models and large samples.
unrelated to errors at the individual level. Third, the level-1 residuals are
normally distributed and have a constant variance. This assumption is very
similar to the one we make about residuals in the standard linear regression
model. Fourth, the level-2 intercept and slope(s) have a multivariate normal
distribution with a constant covariance matrix. Each of these assumptions
can be directly assessed for a sample, as we shall see in forthcoming
chapters. Indeed, the methods for checking the MLM assumptions are not
very different than those for checking the regression model that we used in
Chapter 1.
and
The additional piece in (2.19) is γh1zj, which represents the slope for (γh1),
and value of the average vocabulary score for the school (zj). In other words,
the mean school performance is related directly to the coefficient linking the
An Introduction to Multilevel Data Structure 37
individual vocabulary score to the individual reading score. For our specific
example, we can combine (2.18) and (2.19) in order to obtain a single
equation for level-2 MLMs.
Each of these model terms has been defined previously in the chapter: γ00 is
the intercept or the grand mean for the model, γ10 is the fixed effect of
variable x (vocabulary) on the outcome, U0j represents the random variation
for the intercept across groups, and U1j represents the random variation for
the slope across groups. The additional pieces in (2.13) are γ01 and γ11. γ01
represents the fixed effect of level-2 variable z (average vocabulary) on the
outcome. γ11 represents the slope for and value of the average vocabulary
score for the school. The new term in model (2.20) is the cross-level
interaction, 1001 xij zj . As the name implies, the cross-level interaction is
simply the interaction between the level-1 and level-2 predictors. In this
context, it is the interaction between an individual’s vocabulary score and
the mean vocabulary score for their school. The coefficient for this
interaction term, 1001, assesses the extent to which the relationship between
an examinee’s vocabulary score is moderated by the mean for the school
that they attend. A large significant value for this coefficient would indicate
that the relationship between a person’s vocabulary test score and their
overall reading achievement is dependent on the level of vocabulary
achievement at their school.
Here, the subscript k represents the level-3 cluster to which the individual
belongs. Prior to formulating the rest of the model, we must evaluate if the
slopes and intercepts are random at both levels 2 and 3, or only at level 1, for
example. This decision should always be based on the theory surrounding the
research questions, what is expected in the population, and what is revealed
in the empirical data. We will proceed with the remainder of this discussion
under the assumption that the level-1 intercepts and slopes are random for
38 Multilevel Modeling Using R
We can then use simple substitution to obtain the expression for the level-1
intercept and slope in terms of both level-2 and level-3 parameters.
and
In turn, these terms can be substituted into equation (2.15) to provide the
full level-3 MLMs.
yijk = 000 + V00k + U0jk + ( 100 + V10k + U1jk ) xijk + ijk . (2.24)
collection of data from the same individuals at multiple points in time. For
example, we may have reading achievement scores for examinees in the
fall and spring of the school year. With such a design, we would be able to
investigate issues around growth scores and change in achievement over
time. Such models can be placed in the context of an MLM where the
examinee is the level-2 (cluster) variable, and the individual test adminis-
tration is at level-1. We would then simply apply the level-2 model
described above, including whichever examinee level variables that are
appropriate for explaining reading achievement. Similarly, if examinees
are nested within schools, we would have a level-3 model, with school at
the third level, and could apply model (2.24), once again with whichever
examinee or school-level variables were pertinent to the research question.
One unique aspect of fitting longitudinal data in the MLM context is that
the error terms can potentially take specific forms that are not common in
other applications of multilevel analysis. These error terms reflect the way
in which measurements made over time are related to one another, and
are typically more complex than the basic error structure that we have
described thus far. In Chapter 5, we will look at examples of fitting such
longitudinal models with R, and focus much of our attention on these
error structures, when each is appropriate, and how they are interpreted.
In addition, such MLMs need not take a linear form, but can be adapted to
fit quadratic, cubic, or other non-linear trends over time. These issues will
be further discussed in Chapter 5.
Summary
The goal of this chapter was to introduce the basic theoretical under-
pinnings of multilevel modeling, but not to provide an exhaustive
technical discussion of these issues, as there are a number of useful
sources available in this regard, which you will find among the references
at the end of the text. However, what is given here should stand you in
good stead as we move forward with multilevel modeling using R
software. We recommend that while reading subsequent chapters you
make liberal use of the information provided here, in order to gain a more
complete understanding of the output that we will be examining from R.
In particular, when interpreting output from R, it may be very helpful for
you to come back to this chapter for reminders on precisely what each
model parameter means. In the next two chapters, we will take the
theoretical information from Chapter 2 and apply it to real datasets using
two different R libraries, nlme and lme4, both of which have been
40 Multilevel Modeling Using R
For simple linear multilevel models, the only necessary R subcommands are
the formula (consisting of the fixed and random effects) and the data. The
rest of the subcommands can be used to customize models and provide
additional output. This chapter will first focus on the definition of simple
multilevel models and then demonstrate a few options for model customi-
zation and assumption checking.
DOI: 10.1201/b23166-3 41
42 Multilevel Modeling Using R
summary(Model3.0)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ 1 + (1 | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
−2.3229 −0.6378 −0.2138 0.2850 3.8812
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.3915 0.6257
Residual 5.0450 2.2461
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.30675 0.05498 78.34
In the first part of the function call, we define the formula for the model fixed
effects, which is very similar to the model definition of linear regression using
lm(). The statement geread~gevocab essentially says that the reading
score is predicted with the vocabulary score fixed effect. The call in
parentheses defines the random effects and the nesting structure. If only a
random intercept is desired, the syntax for the intercept is “1”. In this
example,(1|school) indicates that only a random intercept model will be
used and that the random intercept varies within school. This corresponds to
the data structure of students nested within schools. Fitting this model, which
is saved in the output object Model3.1, we obtain the following output.
summary(Model3.1)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + (1 | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
−3.0823 −0.5735 −0.2103 0.3207 4.4334
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09978 0.3159
Residual 3.76647 1.9407
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.023356 0.049309 41.03
gevocab 0.512898 0.008373 61.26
we have the correlation estimate between the fixed effect slope and the fixed
effect intercept, as well as a brief summary of the model residuals, including
the minimum, maximum, and first, second (median), and third quartiles.
The correlation of the fixed effects represents the estimated correlation, if
we had repeated samples, of the two fixed effects (i.e., the intercept and
slope for gevocab). Oftentimes, this correlation is not particularly
interesting. From this output we can see that gevocab is a statistically
significant predictor of geread (t = 61.26), and that as vocabulary score
increases by 1 point, reading ability increases by 0.513 points. Note that
we will revisit the issue of statistical significance when we examine
confidence intervals for the model parameter, below. Indeed, we would
recommend that researchers rely on the confidence intervals when making
such determinations.
In addition to getting estimates of the the fixed effects in model 3.1, we
can ascertain how much variation in geread is present across schools.
Specifically, the output shows that after accounting for the impact of
gevocab, the estimate of variation in intercepts across schools is
0.09978, while the within-school variation is estimated as 3.77647.
We can tie these numbers directly back to our discussion in Chapter 2,
where 02 = 0.09978 and σ2 = 3.76647. In addition, the overall fixed
intercept, denoted as γ00 in Chapter 2, is2.023, and can be interpreted as
the mean of geread when the gevocab score is 0.
Finally, it is possible to estimate the proportion of variance in the
outcome variable that is accounted for at each level of the model. In
Chapter 1, we saw that with single-level OLS regression models, the
proportion of response variable variance accounted for by the model
is expressed as R2. In the context of multilevel modeling, this process is
more complicated by the need to partition the total explained variance
into within and between components. There are multiple approaches
for calculating these R2 values. One technique provides estimates for
each level of the model (Snijders & Bosker, 1999). For level 1, we can
calculate
2 2
M1 + M1 3.76647 + .09978 3.86625
R12 = 1 2 2
=1 =1 =1 .7112 = .2889.
M0 + M0 5.045 + .3915 5.4365
This result tells us that level 1 of model 3.1 explains approximately 29% of
the variance in the reading score above and beyond that accounted for
2 2
M1 / B+
in the null model. We can also calculate a level-2 R2: R22 = 1 2
M1
2
M0 / B+ M0
where B is the average size of the level-2 units, or schools in this case. R
provides us with the number of individuals in the sample, 10320, and the
Fitting Level-2 Models in R 45
2 2 3.76647
M1/ B + M1 64.5 + .09978 .0583
R 22 =1 2 2
=1 5.045
=1 =1 .7493 = .2507
M 0/ B + M0 .0778
64.5 + .3915
library(mitml)
multilevelR2(Model3.1)
multilevelR2(Model3.1)
RB1 RB2 SB MVP
0.2534263 0.7451478 0.2888380 0.2761428
The RB1 and RB2 values correspond to level-1 and level-2 variance explained
according to the formula presented in Raudenbush and Bryk (2002). SB
corresponds to the Snijders and Bosker (1999) level-1 estimate (as computed
above) and the MVP is the total variance explained through multilevel
variance partitioning introduced by LaHuis, Hartman, Hakoyama, and Clark
(2014).
An alternative framework for calculating R2 in the context of MLMs was
described by Rights and Sterba (2019). These authors note that prior work in
the area of variance accounted for with MLMs had several shortcomings,
chief among these being a lack of agreement on how to interpret this
construct and gaps in terms of the sources of variance that can be
represented in the measures. Therefore, Rights and Sterba aimed to develop
a more comprehensive and well-defined framework for MLM R2 and a
software environment in which these statistics can be obtained by
researchers. We will not delve into the full technical details of this approach
here, but the interested reader is encouraged to read the paper by Rights
and Sterba, which describes the approach in full detail.
Rights and Sterba (2019) defined the variance explained by level-1 fixed
effects (f1), level-2 fixed effects (f2), random coefficient variance (v), and
random intercept variance (m) as shown below.
2
2(f 1) f1
R total = 2 2 2 2 2
f1 + f2 + v + m + r
46 Multilevel Modeling Using R
2
2(f 2) f2
R total = 2 2 2 2 2
f1 + f2 + v + m + r
2
2(v ) v
R total = 2 2 2 2 2
f1 + f2 + v + m + r
2
2(m) m
R total = 2 2 2 2 2
f1 + f2 + v + m + r
The residual variance ( r2 ) includes all of the variability in the outcome that
is not accounted for by the fixed effects, as well as random intercept and
slope variance. These R2 can be obtained using the r2mlm library in R.
r2mlm(Model 3.1)
$Decompositions
total
fixed 0.27317389
slope variation 0.00000000
mean variation 0.01987598
sigma2 0.70695013
$R2s
total
f 0.27317389
v 0.00000000
m 0.01987598
fv 0.27317389
fvm 0.29304987
The r2mlm function provides us with two sets of results. First, we have the
variance decomposition associated with each term, including the fixed effects
(0.27), the random slope (0), the random intercept (0.02), and the residual
(0.71). Next, the R2 values associated with each term and their various
combinations are provided. Thus, the independent variable gevocab ac-
counted for approximately 27% of the variance (f) in the dependent variable
geread. An additional 2% of the variance (m) in reading scores was accounted
for by variation in score means across schools. The fv term refers to the
combined variance accounted for by the independent variable fixed effect and
variation in the slope across schools. Since the slope did not have a random
component in this model, the variance associated with it and the R2 were 0. The
total variance in reading scores accounted for by the fixed and random effects
corresponds to the fvm term, and is approximately 29% for this model.
Fitting Level-2 Models in R 47
The model in the previous example was quite simple, only incorporating
one level-1 predictor. In many applications, researchers will have predictor
variables at both level 1 (student) and level 2 (school). Incorporation of
predictors at higher levels of analysis is very straightforward in R, and is
done exactly in the same manner as incorporation of level-1 predictors. For
example, let’s assume that in addition to a student’s vocabulary test
performance, the researcher wanted to determine whether the size of the
school (senroll) has a statistically significant impact on the overall
reading score. In that instance, adding the level-2 predictor, school
enrollment, would result in the following R syntax:
Scaled residuals:
Min 1Q Median 3Q Max
−3.0834 −0.5729 −0.2103 0.3212 4.4336
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1003 0.3168
Residual 3.7665 1.9408
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.0748819 0.1140074 18.20
gevocab 0.5128708 0.0083734 61.25
senroll −0.0001026 0.0002051 -0.50
Note that in this particular function call, senroll is included only in the
fixed part of the model and not the random part. This variable thus only has
a fixed (average) effect and is the same across all schools. We will see
shortly how to incorporate a random coefficient in this model.
From these results, we can see that, in fact, enrollment did not have a
statistically significant relationship with reading achievement (t = −0.50). In
addition, notice that there were some minor changes in the estimates of the
other model parameters, but a fairly large change in the correlation between
the fixed effect of gevocab slope and the fixed effect of the intercept, from
−0.758 to −0.327. The slope for senroll and the intercept were very
strongly negatively correlated, and the slopes of the fixed effects exhibited
virtually no correlation (−0.002). As noted before, these correlations are
typically not very informative in terms of understanding the dependent
variable, and will therefore rarely be discussed in any detail in reporting
analysis results. The R2 values for levels 1 and 2 are shown below.
2 2
M1 + M1 3.7665 + .1003 3.8668
R12 = 1 2 2
=1 =1 =1 .7112 = .2887.
M0 + M0 5.045 + .3915 5.4365
2 2 3.7665
M1/ B + M1 64.5 + .1003 .0583
R22 = 1 2 2
=1 5.045
=1 =1 .7494 = .2506
M 0/ B + M0 .0778
64.5 + .3915
> multilevelR2(Model3.2)
RB1 RB2 SB MVP
0.2534108 0.7437147 0.2887204 0.2760608
These values are nearly identical to those obtained for the model without
senroll. This is because there was very little change in the variances when
senroll was included in the model; thus, the amount of variance explained at
each level was largely unchanged.
for two level-1 variables (Model 3.3) and a cross-level interaction involving
level-1 and level-2 variables (Model 3.4).
Model 3.3 defines a multilevel model where two level-1 (student level)
predictors interact with one another. Model 3.4 defines a multilevel model
with a cross-level interaction: where a level-1 (student level) and a level-2
(school level) predictor are interacting. As can be seen, there is no
difference in the treatment of variables at different levels when computing
interactions.
summary(Model3.3)
Linear mixed model fit by REML [‘lmerMod’]
Formula: geread ~ gevocab + age + gevocab * age + (1 | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
−3.0635 −0.5706 -0.2108 0.3191 4.4467
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09875 0.3143
Residual 3.76247 1.9397
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 5.187208 0.866786 5.984
gevocab −0.028077 0.188145 −0.149
age −0.029368 0.008035 −3.655
gevocab:age 0.005027 0.001750 2.873
Looking at the output from Model 3.3, both age (t = −3.65) and the
interaction (gevocab:age) between age and vocabulary (t = 2.87) are
statistically significant predictors of reading. Focusing on the interaction,
the sign on the coefficient is positive indicating an enhancing effect: As
age increases, the relationship between reading and vocabulary becomes
stronger. Interestingly, when both age and the interaction are included in
the model, the relationship between vocabulary score and reading perform-
ance is no longer statistically significant.
Next, let’s examine a model that includes a cross-level interaction.
Scaled residuals:
Min 1Q Median 3Q Max
−3.1228 −0.5697 −0.2090 0.3188 4.4359
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1002 0.3165
Residual 3.7646 1.9403
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.748e+00 1.727e−01 10.118
gevocab 5.851e−01 2.986e−02 19.592
senroll 5.121e−04 3.186e−04 1.607
gevocab:senroll −1.356e−04 5.379e−05 −2.520
The output from Model 3.4 has a similar interpretation to that of Model 3.3.
In this example when school enrollment is used instead of age as a
Fitting Level-2 Models in R 51
Notice that we are no longer explicitly stating in the specification that there is
a random intercept. Once a random slope is defined, the random intercept
becomes implicit so we no longer need to specify it (i.e., it is included by
default). If we do not want the random intercept while modeling the random
coefficient, we would include a −1 immediately prior to gevocab. The
random slope and intercept syntax will generate the following model
summary:
Scaled residuals:
Min 1Q Median 3Q Max
−3.7102 −0.5674 −0.2074 0.3176 4.6775
52 Multilevel Modeling Using R
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 0.2827 0.5317
gevocab 0.0193 0.1389 −0.86
Residual 3.6659 1.9147
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.00570 0.06109 32.83
gevocab 0.52036 0.01442 36.10
r2mlm(Model3.5)
$Decompositions
total
fixed 0.27899145
slope variation 0.01960018
mean variation 0.02049923
sigma2 0.68090914
Fitting Level-2 Models in R 53
$R2s
total
f 0.27899145
v 0.01960018
m 0.02049923
fv 0.29859163
fvm 0.31909086
The fixed effect accounted for approximately 28% of the variance in reading,
the random slope for 2%, the random intercept 2%, the fixed effect plus
random slope for 30%, and the fixed effects, random slope, and random
intercept together accounted for 32% of the variance in geread.
A model with two random slopes can be defined in much the same way
as for a single slope. As an example, suppose that our researcher is
interested in determining whether the age of the student also impacted
reading performance, and wants to allow this effect to vary from one
school to another. In lme4, random effects are allowed to be either
correlated or uncorrelated, providing excellent modeling flexibility. As an
example, refer to Models 3.6 and 3.7, each of which predict geread from
gevocab and age allowing both gevocab and age to be random across
schools. They differ however in that with Model 3.6 the random slopes are
treated as correlated with one another, whereas in Model 3.7 they are
specified as being uncorrelated. This lack of correlation in Model 3.7 is
expressed by having separate random effect terms (gevocab|school)
and (age|school), while in contrast, Model 3.6 includes both random
effects in the same term (gevocab + age|school).
Scaled residuals:
Min 1Q Median 3Q Max
−3.0655 −0.5711 −0.2080 0.3185 4.4645
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 4.180e−01 0.646532
gender 1.981e−02 0.140739 0.84
age 3.336e−05 0.005776 −0.90 −0.99
Residual 3.760e+00 1.939196
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.998839 0.420586 7.130
gevocab 0.512518 0.008377 61.179
age −0.009412 0.003875 −2.429
gender 0.025591 0.040303 0.635
Scaled residuals:
Min 1Q Median 3Q Max
−3.6937 −0.5681 −0.2081 0.3182 4.6743
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 2.042e−01 0.4519018
Fitting Level-2 Models in R 55
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.973609 0.414540 7.173
gevocab 0.519189 0.014397 36.063
age −0.008955 0.003798 -2.358
Notice the difference in how random effects are expressed in lmer between
Models 3.6 and 3.7. With random effects, R reports estimates for the variability
of the random intercept, variability for each random slope, and the correlations
between the random intercept and the random slopes. Output in Model 3.7,
however, reports two different sets of uncorrelated random effects with an
intercept estimated for each separate random effect. If uncorrelated random
effects are desired without the separate intercept estimated for each, the
following code can be used to eliminate the extra intercepts.
Scaled residuals:
Min 1Q Median 3Q Max
−3.6084 −0.5664 −0.2058 0.3087 4.5437
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09547 0.3090
56 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.351259 0.032014 135.917
gevocab 0.521053 0.014133 36.869
age −0.008652 0.003807 −2.272
Centering Predictors
As per the discussion in Chapter 2, it may be advantageous to center
predictors, especially when interactions are incorporated. Centering
predictors can both provide easier interpretation of interaction terms
as well as help alleviate issues of multicollinearity arising from the
inclusion of both main effects and interactions in the same model. Recall
that the centering of a variable entails the subtraction of a mean value from
each score on the variable. Centering of predictors can be accomplished
through R by the creation of new variables. For example, returning to Model
3.3, grand mean-centered gevocab and age variables can be created with
the following syntax:
Once mean-centered versions of the predictors have been created, they can
be incorporated into the model in the same manner as before.
Scaled residuals:
Min 1Q Median 3Q Max
−3.0635 −0.5706 −0.2108 0.3191 4.4467
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09875 0.3143
Residual 3.76247 1.9397
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.332327 0.032062 135.124
Cgevocab 0.512480 0.008380 61.159
Cage −0.006777 0.003917 −1.730
Cgevocab:Cage 0.005027 0.001750 2.873
Focusing on the fixed effects of the model, there are some changes in their
values. These differences are likely due to the presence of multicollinearity
issues in the original uncentered model. The interaction is still significant
(t = 2.87); however, there is now a significant effect of vocabulary (t = 61.15),
and age is no longer a significant predictor (t = −1.73). Focusing on the
interaction, recall that when predictors are centered, the interaction can
be interpreted as the effect of one variable while holding the second variable
constant. Thus, since the sign on the interaction is positive, if we hold age
constant, vocabulary has a positive impact on reading ability.
Additional Options
Parameter Estimation Method
By default, lme4 uses restricted maximum likelihood (REML) estimation.
However, it also allows for the use of maximum likelihood (ML) estimation
instead. Model 3.8 demonstrates the syntax for fitting a multilevel model using
ML. In order to change the estimation method, the call is REML = FALSE.
Estimation Controls
Sometimes a correctly specified model will not reach a solution (converge) using
the default settings for model convergence. Many times this problem can be
fixed by changing the default estimation controls using the lmerControl()
option. lmerControl() allows the user to change settings related to the model
estimation. Quite often, convergence issues can be fixed by changing the model
iteration limit (maxfun), or by changing the model optimizer (optimizer). In
order to specify which controls will be changed, R must be given a list of controls
and their new values. For example, control=lmerControl(optCtrl=list
(maxfun=5000)) would change the maximum number of iterations to 5000.
control=lmerControl(optimizer =“nlminbwrap”) would change the
optimizer from the default (nloptwrap for lmer) to the bobyqa optimizer.
These control options are placed in the R code in the same manner as the
choice of the estimation method (separated from the rest of the syntax
with a comma). A comprehensive list of estimation controls can be found
on the R help ?lme4 pages. See below the syntax for Model 3.5 changing
the optimizer to nlminbwrap to fix convergence issues.
anova(Model3.0,Model3.1)
refitting model(s) with ML (instead of REML)
Data: Achieve
Models:
Model3.0: geread ~ 1 + (1 | school)
Model3.1: geread ~ gevocab + (1 | school)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
Model3.0 3 46270 46292 −23132 46264
Model3.1 4 43132 43161 -21562 43124 3139.9 1 < 2.2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Referring to the AIC and BIC statistics, recall that smaller values reflect
better model fit. For Model 3.1, the AIC and BIC are 43132 and 43161,
respectively, whereas for Model 3.0, the AIC and BIC were 46270 and
46292. Given that the values for both statistics are smaller for Model 3.1,
we would conclude that it provides a better fit to the data. Since these
models are nested, we may also look at the Chi-square test for deviance.
This test yielded a statistically significant Chi-square (χ2 = 3139.9, p < .001)
indicating that Model 3.1 provided a significantly better fit to the data
than does Model 3.0. Substantively, this means that we should include
the predictor variable geread, which results of the hypothesis test also
supported.
package is loaded, running your multilevel model using lmer() will provide
updated summary output.
library(lmerTest)
Model3.1 <- lmer(geread~gevocab + (1|school), data=Achieve)
summary(Model3.1)
Scaled residuals:
Min 1Q Median 3Q Max
−3.0823 −0.5735 −0.2103 0.3207 4.4334
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09978 0.3159
Residual 3.76647 1.9407
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.023e+00 4.931e−02 7.582e+02 41.03 <2e−16 ***
gevocab 5.129e−01 8.373e−03 9.801e+03 61.26 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
distribution. For the standard error bootstrap, the standard deviation of the
B samples is calculated and serves as the standard error estimate. The
confidence interval for a given parameter is then estimated as:
± z SEB (3.1)
where
The normal bootstrap is very similar to the standard error approach, with
the difference being that the B samples are generated from a normal
distribution with marginal statistics (i.e., means, variances) equal to
those of the raw data, rather than through resampling. In other respects,
it is equivalent to the standard error method. Two other approaches for
calculating confidence intervals are available using confint. The first of
these applies only to the fixed effects, and is based on the assumption of
normally distributed errors. The confidence interval is then calculated as
The final confidence interval method that we can use is called the profile
confidence interval, and is not based on any assumptions about the distribu-
tion of the errors. Instead, we select as the bounds of the confidence interval
two points on either side of the MLE estimate with likelihood equal to
2
the interval is the smallest 0 value for which lnL1 ( 0) lnL1 ( ) 0.5 cv ,
with 1 degree of freedom. If we set α = 0.05, then cv2 = 3.84, so that
2
0.5 cv = 1.92 .
These confidence intervals can be obtained using the confint function, as
below.
2.5% 97.5%
.sig01 NA NA
.sig02 NA NA
.sig03 NA NA
.sigma NA NA
(Intercept) 4.2799887 4.4082236
Cgevocab 0.4921036 0.5486107
confint(Model3.5, method=c("profile"))
Computing profile confidence intervals …
2.5% 97.5%
.sig01 0.2623302 0.3817036
.sig02 0.2926177 0.7139976
.sig03 0.1141790 0.1656869
.sigma 1.8884541 1.9414934
(Intercept) 4.2790205 4.4083316
Cgevocab 0.4920097 0.5490748
The last section of the output contains the confidence intervals for the fixed
and random effects of Model 3.5. The top three rows correspond to the
random intercept (.sig01), the correlation between the random intercept
and random slope (.sig02), and the random slope (.sig03). According to
these confidence intervals, since zero does not appear in the interval, there
is statistically significant variation in the intercept across schools
CI95[.262,.382], the slope for gevocab across schools CI95[.114,.166], and
there is a significant relationship between the random slope and the random
intercept CI95[.293, .713].
The bottom two rows correspond to the fixed effects. According to the
confidence intervals, the intercept (γ00) is significant CI95[4.279, 4.408] and
the slope for gevocab is significant CI95[.492, .549] each corresponding
with the t values presented in the output previously displayed in the
chapter.
Summary
In this chapter, we put the concepts that we learned in Chapter 2 to work
using R. We learned the basics of fitting level-2 models when the dependent
variable is continuous, using the lme4 package. Within this multilevel
framework, we learned how to fit the null model, as well as the random
intercept and random slope models. We also included independent vari-
ables at both levels of the data, and finally, learned how to compare the fit of
models with one another. This last point will prove particularly useful as we
engage in the process of selecting the most parsimonious model (i.e., the
64 Multilevel Modeling Using R
Notes
1. These models were run usinglmerControl(optCtrl = list(maxfun =
5000), (optimizer = “nlminbwrap”))to fix convergence issues. The
models converged models are singular due to the overfitting of unnecessary
random effects.
2. This model produces a warning that the predictor variables are on very different
scales. This warning can be addressed by mean centering the predictors.
3. This model was run usinglmerControl(optCtrl=list(optimizer =
nlminbwrap)to fix convergence issues.
4
Level-3 and Higher Models
In this chapter, we will continue working with the Achieve data that was
described in Chapter 3. In our Chapter 3 examples, we included two levels of
data structure, students within schools, along with associated predictors of
reading achievement at each level. We will now add a third level of structure,
classroom, which is nested within schools. In this context, then, students are
nested within classrooms, which are in turn nested within schools.
DOI: 10.1201/b23166-4 65
66 Multilevel Modeling Using R
As can be seen, the syntax for fitting a random intercepts model with three
levels is very similar to that for the random intercepts model with two
levels. In order to define a model with more than two levels, we need to
include the variables denoting the higher levels of the nesting structures,
here school (school level influence) and class (classroom level influence),
and designate the nesting structure of the levels (students within classrooms
within schools). Nested structure in lmer is defined as A/B, where A is the
higher-level data unit (e.g., school) and B is the lower-level data unit (e.g.,
classroom). The intercept (1) is denoted as a random effect by its inclusion in
the parentheses.
Scaled residuals:
Min 1Q Median 3Q Max
−2.3052 −0.6290 −0.2094 0.3049 3.8673
Random effects:
Groups Name Variance Std.Dev.
class:school (Intercept) 0.2727 0.5222
school (Intercept) 0.3118 0.5584
Residual 4.8470 2.2016
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.30806 0.05499 78.34
Level-3 and Higher Models 67
confint(Model4.0, method=c(“profile”))1
2.5 % 97.5 %
.sig01 0.4516596 0.5958275
.sig02 0.4658891 0.6561674
.sigma 2.1710519 2.2328459
(Intercept) 4.1997929 4.4160128
Scaled residuals:
Min 1Q Median 3Q M ax
−3.2212 −0.5673 −0.2079 0.3184 4.4736
Random effects:
Groups Name Variance Std.Dev.
class:school (Intercept) 0.09047 0.3008
school (Intercept) 0.07652 0.2766
Residual 3.69790 1.9230
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.675e+00 2.081e−01 8.050
gevocab 5.076e−01 8.427e−03 60.233
clenroll 1.899e−02 9.559e−03 1.986
cenroll −3.721e-06 3.642e−06 −1.022
We can see from the output for Model 4.1 that the student’s vocabulary
score (t = 60.23) and size of the classroom (t = 1.99) are statistically
significantly positive predictors of student reading achievement score, but
the size of the school (t = −1.02) does not significantly predict reading
achievement. As a side note, the significant positive relationship between
the size of the classroom and reading achievement might seem to be a bit
confusing, suggesting that students in larger classrooms had higher
reading achievement test scores. However, in this particular case, larger
Level-3 and Higher Models 69
confint(Model4.1, method=c("profile"))1
2.5% 97.5%
.sig01 2.312257e−01 3.660844e−01
.sig02 2.055494e−01 3.398754e−01
.sigma 1.896242e+00 1.950182e+00
(Intercept) 1.268352e+00 2.082537e+00
gevocab 4.910305e−01 5.244195e−01
clenroll 2.564126e−04 3.768655e−02
cenroll −1.084902e−05 3.394918e−06
In terms of the fixed effects, the 95% confidence intervals demonstrate that
vocabulary score and class size are statistically significant predictors of reading
score, but school size is not. In addition, we see that although the variation in
random intercepts for schools and classrooms nested in schools declined with the
inclusion of the fixed effects, we would still conclude that the random intercept
terms are different from 0 in the population, indicating that mean reading scores
differ across schools, and across classrooms nested within schools.
The R2 value for Model 4.1 can be calculated as follows:
2 2
M1 + M1 3.6979 + .09047 3.78837
R12 = 1 2 2
=1 =1 =1 .73996 = .26.
M0 + M0 4.847 + .2727 5.1197
From this value, we see that the inclusion of the classroom and school
enrollment variables, along with the student’s vocabulary score, results in a
model that explains approximately 26% of the variance in the reading score,
above and beyond the null model. This can be run in R using the
r.squaredLR() function from the MuMIn package.
r.squaredLR(Model4.1, null=Model4.0)
[1] 0.2562031
attr(,"adj.r.squared")
[1] 0.2591666
70 Multilevel Modeling Using R
anova(Model4.0, Model4.1)
Data: Achieve
Models:
Model4.0: geread ~ 1 + (1 | school/class)
Model4.1: geread ~ gevocab + clenroll + cenroll + (1 | school/class)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
Model4.0 4 46150 46179 −23071 46142
Model4.1 7 43101 43152 −21544 43087 3054.6 3 < 2.2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the original null model, AIC and BIC were 46150 and 46179,
respectively, which are both larger than the AIC and BIC for Model 4.1.
Therefore, we would conclude that this latter model including a single
predictor variable at each level provides a better fit to the data, and thus
is preferable to the null model with no predictors. We could also look at
the Chi-square deviance test since these are nested models. The
Chi-square test is statistically significant, indicating Model 4.1 is a better
fit for the data.
Using lmer, it is very easy to include both single-level and cross-level
interactions in the model once the higher-level structure is understood. For
example, we may have a hypothesis stating that the impact of vocabulary
scores on reading achievement varies depending upon the size of the school
that a student attends. In order to test this hypothesis, we will need to
include the interaction between vocabulary score and the size of the school,
as is done in Model 4.2 below.
summary(Model4.2)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + clenroll + cenroll + gevocab *
cenroll + (1 | school/class)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
−3.1902 −0.5683 −0.2061 0.3183 4.4724
Random effects:
Groups Name Variance Std.Dev.
class:school (Intercept) 0.08856 0.2976
school (Intercept) 0.07513 0.2741
Residual 3.69816 1.9231
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.752e+00 2.100e−01 8.341
gevocab 4.900e−01 1.168e−02 41.940
clenroll 1.880e−02 9.512e−03 1.977
cenroll −1.316e−05 5.628e−06 −2.337
gevocab:cenroll 2.340e−06 1.069e−06 2.190
In this example, we can see that, other than the inclusion of the higher
level nesting structure in the random effects line, defining a cross-level
interaction in a model with more than two levels is no different than was
the case for the level-2 models that we worked with in chapter 3. In terms
of hypothesis testing results, we find that student vocabulary (t = 41.94)
and classroom size (t = 1.98) remain statistically significant positive
predictors of reading ability. In addition, both the cross-level interaction
between vocabulary and school size (t = 2.19) and the impact of school
size alone (t = −2.34) are statistically significant predictors of the reading
score. The statistically significant interaction term indicates that the
impact of student vocabulary score on reading achievement is dependent
to some degree on the size of the school, so that the main effects for school
and vocabulary cannot be interpreted in isolation, but must be considered
in light of one another. The interested reader is referred to Aiken and
West (1991) for more detail regarding the interpretation of interactions in
regression.
72 Multilevel Modeling Using R
r.squaredLR(Model4.2, null=Model4.0)
[1] 0.2565497
attr(,"adj.r.squared")
[1] 0.2595171
anova(Model4.1, Model4.2)
refitting model(s) with ML (instead of REML)
Data: Achieve
Models:
Model4.1: geread ~ gevocab + clenroll + cenroll + (1 | school/class)
Model4.2: geread ~ gevocab + clenroll + cenroll + gevocab * cenroll + (1 | school/class)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
Model4.1 7 43101 43152 −21544 43087
Model4.2 8 43099 43157 −21541 43083 4.8105 1 0.02829 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
levels. The lmer function in R can be used to fit such higher-level models in
much the same way as we have seen up to this point. As a simple example
of such higher-order models, we will again fit a null model predicting
reading achievement, this time incorporating four levels of data: students
nested within classrooms nested within schools nested within school
corporations (sometimes termed districts). In order to represent the three
higher levels of influence, the random effects will now be represented as
(1|corp/school/class) in Model 4.3, below. In addition to fitting the
model and obtaining a summary of results, we will also request boot-
strapped 95% confidence intervals for the model parameters using the
confint() function.
In order to ensure that the dataset is being read by R as one thinks it should
be, we can first find the summary of the sample sizes for the different data
levels which occurs toward the bottom of the printout. There were 10,320
students (groups) nested within 568 classrooms (within schools within
corporations) nested within 160 schools (within corporations) nested within
59 school corporations. This matches what we know about the data;
therefore, we can proceed with the interpretation of the results. Given
that this is a null model with no fixed predictors, our primary focus is on the
intercept estimates for the random effects and their associated confidence
intervals. We can see from the confidence interval results below that each
level of the data yielded intercepts that were statistically significantly
different from 0 (given that 0 does not appear in any of the confidence
intervals), indicating that mean reading achievement scores differed among
the classrooms, the schools, and the school corporations.
Scaled residuals:
Min 1Q Median 3Q Max
−2.2995 −0.6305 −0.2131 0.3029 3.9448
Random effects:
Groups Name Variance Std.Dev.
class:(school:corp) (Intercept) 0.27539 0.5248
school:corp (Intercept) 0.08748 0.2958
74 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.32583 0.07198 60.1
confint(Model4.3, method=c(“profile”))
.sig01 0.4544043 0.5983855
.sig02 0.1675851 0.4083221
.sig03 0.3147483 0.5430904
.sigma 2.1710534 2.2328418
(Intercept) 4.1838496 4.4679558
Scaled residuals:
Min 1Q Median 3Q Max
−3.2043 −0.5680 −0.2069 0.3171 4.4455
Random effects:
Groups Name Variance Std.Dev. Corr
class:school (Intercept)
0.148959 0.38595
gender 0.019818 0.14078 -0.62
school (Intercept)
0.033157 0.18209
gender 0.006803 0.08248 0.58
Residual 3.692434 1.92157
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.015564 0.075629 26.651
gevocab 0.509097 0.008408 60.550
gender 0.017237 0.039243 0.439
Scaled residuals:
Min 1Q Median 3Q Max
−3.2117 −0.5677 −0.2072 0.3183 4.4324
Random effects:
Groups Name Variance Std.Dev. Corr
school_class (Intercept) 0.13132 0.3624
gender 0.02727 0.1651 −0.56
school (Intercept) 0.07492 0.2737
Residual 3.69221 1.9215
Number of obs: 10320, groups: school_class, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.012884 0.077265 26.052
gevocab 0.509047 0.008415 60.496
gender 0.019057 0.038806 0.491
from 1 to 8 for school 2. When the level 1 and level 2 identifiers were tied
together in our original syntax, having nonunique codes for each classroom
was not an issue as school 1 classroom 6 would be different from school 2
classroom 6. However, when we treat random effects more flexibly and split
them into different statements for each level, the level 1 identifier loses its
context. In order to obtain correct results, a unique identifier must be used
for both levels. In R, a quick way to do this is to use the paste() function.
Summary
Chapter 4 is very much an extension of chapter 3, demonstrating the use of
R in fitting level-2 models to include data structure at three or more levels.
In practice, such complex multilevel data are relatively rare. However, as
we saw in this chapter, faced with this type of data, we can use lmer to
model it appropriately. Indeed, the basic framework that we employed in
the level-2 case works equally well for the more complex data featured in
this chapter. If you have read the first four chapters, you should now feel
fairly comfortable analyzing the most common multilevel models with a
continuous outcome variable. We will next turn our attention to the
application of multilevel models to longitudinal data. Of key importance
as we change directions a bit is that the core ideas that we have already
learned, including fitting of the null, random intercept, and random
coefficients models, as well as the inclusion of predictors at different levels
of the data, do not change when we have longitudinal data. As we will see,
the application of multilevel models in this context is no different from that
78 Multilevel Modeling Using R
Notes
1. This is the final section of code using confint() function to bootstrap confidence
intervals. For full R code and output, see Chapter 3.
2. This model did not converge without changing the optimizer to nlminbwrap
using the estimation controls.
5
Longitudinal Data Analysis Using
Multilevel Models
DOI: 10.1201/b23166-5 79
80 Multilevel Modeling Using R
1i = 10 + r1i
2i = 20 + r2i
where Yit is the outcome variable for individual i at time t; πit are the level-1
regression coefficients; βit are the level-2 regression coefficients; εit is
the level-1 error, rit are the level-2 random effects; Tit is a dedicated time
predictor variable; Xit is a time-varying predictor variable; and Zi is a time-
invariant predictor. Thus, as can be seen in (5.1), although there is a new
notation to define specific longitudinal elements, the basic framework
for the multilevel model is essentially the same as we saw for the level-2
model in Chapter 3. The primary difference is that now we have three
different types of predictors; a time predictor, time-varying predictors
and time-invariant predictors. Given their unique role in longitudinal
modeling, it is worth spending just a bit of time defining each of these
predictor types.
Of the three types of predictors that are possible in longitudinal models,
a dedicated time variable is the only one that is necessary in order to make
the multilevel model longitudinal. This time predictor, which is literally
an index of the time point at which a particular measurement was made,
can be very flexible with the time measured in fixed intervals or in waves.
If time is measured in waves, they can be waves of varying length from
person to person, or they can be measured on a continuum. It is important
to note that when working with time as a variable, it is often worthwhile
to rescale the time variable so that the first measurement occasion is the
zero point, thereby giving the intercept the interpretation of baseline or
initial status on the dependent variable.
The other two types of predictors (time varying and time invariant) differ
in terms of how they are measured. A predictor is time varying when it is
measured at multiple points in time, just as is the outcome variable. In the
context of education, a time-varying predictor might be the number of
hours in the previous 30 days a student has spent studying. This value
could be recorded concurrently with the student taking the achievement test
serving as the outcome variable. On the other hand, a predictor is time
invariant when it is measured at only one point in time, and its value does
not change across measurement occasions. An example of this type of
predictor would be gender, which might be recorded at the baseline
measurement occasion, and which is unlikely to change over the course
of the data collection period. In the context of applying multilevel models to
longitudinal data problems, time-varying predictors will appear at level 1
because they are associated with specific measurements, whereas time-
invariant predictors will appear at level 2 or higher, because they are
associated with the individual (or a higher data level) across all measurement
conditions.
Longitudinal Data Analysis Using Multilevel Models 81
library(reshape2)
This code will stack all variables not identified by the id.vars vector. The
id.vars vector can be used to identify subject ID as well as any time-
invariant covariates that should not be stacked.
Grammar=Lang$Grammar, stack(Lang,select=LangScore1:
LangScore6))
Each of these R options takes the time-invariant variables directly from the
raw person-level data, while also consolidating the repeated measurements
into a single variable and creating a dedicated time indicator. Although the
melt() function is much more simple, the stack() function is necessary
in more complicated situations (for example if multiple variables need to be
stacked with multiple different indicators of time).
At this point, we may wish to do some recoding and renaming of variables.
Renaming of variables can be accomplished via the names function, and
recoding can be done via recode(var, recodes, as.factor.result,
as.numeric=TRUE, levels).
For instance, we could rename the values variable to Language. The
values variable is the seventh column, so in order to rename it, we would
use the following R code:
names(LangPP)[c(6)] <- c("Language"). We may also wish to
recode the dedicated time variable, ind. Currently, this variable is not
recorded numerically, but takes on the values “LangScore1”, “LangScore2”,
“LangScore3”, “LangScore4”, “LangScore5”, and “LangScore6”. Thus, we may
wish to recode the values to make a continuous numeric time predictor, as
follows:
restructured in the previous section, we would use the following syntax for
a longitudinal random intercept model predicting Language over time
using the lme4 package:
summary(Model5.0)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + (1 | ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
−6.4040 −0.4986 0.0388 0.5669 4.8636
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 232.14 15.236
Residual 56.65 7.526
Number of obs: 18228, groups: ID, 3038
Fixed effects:
Estimate Std. Error t value
(Intercept) 197.21573 0.29356 671.80
Time 3.24619 0.03264 99.45
increased over time. Also, the confidence intervals for the fixed and random
effects all exclude 0, indicating that they are different from 0 in the
population, i.e., they are statistically significant.
Adding predictors to the model also remains the same as in earlier
examples, regardless of whether they are time varying or time invariant. For
example, in order to add Grammar, which is time varying, as a predictor of
total language scores over time we would use the following:
summary(Model5.1)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + Grammar + (1 | ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
−6.6286 −0.5260 0.0374 0.5788 4.6761
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 36.13 6.011
Residual 56.64 7.526
Number of obs: 18216, groups: ID, 3036
Fixed effects:
Estimate Std. Error t value
(Intercept) 73.703551 1.091368 67.53
Time 3.245483 0.032651 99.40
Grammar 0.630888 0.005523 114.23
From these results, we see that again, Time is positively related to scores on
the language assessment, indicating that they increased over time. In
addition, Grammar is also statistically significantly related to language
test scores (t = 114.23), meaning that measurement occasions with higher
grammar test scores also had higher language scores.
If we wanted to allow the growth rate to vary randomly across
individuals, we would use the following R command.
Longitudinal Data Analysis Using Multilevel Models 85
summary(Model5.2)
Linear mixed model fit by REML [‘lmerMod’]
Formula: Language ~ Time + Grammar + (Time | ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
−5.3163 −0.5240 −0.0012 0.5363 5.0321
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 21.577 4.645
Time 3.215 1.793 0.03
Residual 45.395 6.738
Number of obs: 18216, groups: ID, 3036
Fixed effects:
Estimate Std. Error t value
(Intercept) 54.470619 1.002595 54.33
Time 3.245483 0.043740 74.20
Grammar 0.729119 0.005083 143.46
confint(Model5.2, method=c("profile")) 1
2.5% 97.5%
.sig01 4.36967084 4.9175027
.sig02 -0.06472334 0.1250178
.sig03 1.70949317 1.8765765
.sigma 6.65369657 6.8231795
(Intercept) 51.93115382 57.0105666
Time 3.15974007 3.3312255
Grammar 0.71620566 0.7420327
In this model, the random effect for Time is assessing the extent to which
growth over time differs from one person to the next. Results show that the
random effect for Time is statistically significant, given that the 95%
86 Multilevel Modeling Using R
confidence interval does not include 0. Thus, we can conclude that growth
rates in language scores over the six time points do differ across individuals
in the sample.
We could add a third level of data structure to this model by including
information regarding schools, within which examinees are nested. To fit
this model, we use the following code in R:
summary(Model5.3)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + (1 | school/ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
−6.4590 −0.5026 0.0400 0.5658 4.8580
Random effects:
Groups Name Variance Std.Dev.
ID:school (Intercept) 187.18 13.681
school (Intercept) 69.11 8.313
Residual 56.65 7.526
Number of obs: 18228, groups: ID:school, 3038; school, 35
Fixed effects:
Estimate Std. Error t value
(Intercept) 197.33787 1.48044 133.30
Time 3.24619 0.03264 99.45
Using the anova() command, we can compare fit of level-3 and level-2
versions of this model.
anova(Model5.0, Model5.3)
refitting model(s) with ML (instead of REML)
Data: LangPP
Models:
Model5.0: Language ~ Time + (1 | id)
Model5.3: Language ~ Time + (1 | school/id)
Longitudinal Data Analysis Using Multilevel Models 87
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given that the AIC for Model 5.3 is lower than that for Model5.0 where
school is not included as a variable, and the Chi-square test for deviance is
significant (χ2 = 521.99, p < 0.001), we can conclude that inclusion of the
school level of the data leads to a better model fit.
data levels, thereby allowing for more complex data structures. In the
context of longitudinal data, this means that it is possible to incorporate
measurement occasion (level 1), individual (level 2), and cluster (level 3)
characteristics. We saw an example of this type of analysis in Model 5.3.
On the other hand, in the context of repeated measures ANOVA or
MANOVA, incorporating these various levels of data would be much
more difficult. Thus, the use of multilevel modeling in this context not only
has the benefits listed above pertaining specifically to longitudinal analysis
but it also brings the added capability of simultaneous analysis of multiple
levels of influence.
Summary
In this chapter, we saw that the multilevel modeling tools we studied
together in Chapters 2–4 can be applied in the context of longitudinal data.
The key to this analysis is the treatment of each measurement in time as a
level-1 data point, and the individuals on whom the measurements are
made as level 2. Once this shift in thinking is made, the methodology
remains very much the same as what we employed in the standard
multilevel models in Chapters 3 and 4. By modeling longitudinal data in
this way, we are able to incorporate a wide range of data structures,
including individuals (level 2) nested within a higher level of data (level 3).
Notes
1. This is the final section of code using confint() function to bootstrap confidence
intervals. For full R code and output, see Chapter 3.
2. If the user is reading SPSS data into R Studio using the import data interface, the
stack() function will not work as the data attributes are incompatible. Instead,
read data into R manually using the read.spss() function from the foreign
package.
6
Graphing Data in Multilevel Contexts
Graphing data is an important step in the analysis process. Far too often
researchers skip the graphing of their data and move directly into analysis
without the insights that can come from first giving the data a careful
visual examination. It is certainly tempting for researchers to bypass data
exploration through graphical analysis and move directly into formal
statistical modeling, given that the models are generally the tools used to
answer research questions. However, if proper attention is not paid to the
graphing of data, the formal statistical analyses may be poorly informed
regarding the distribution of variables, and relationships among them. For
example, a model allowing only a linear relationship between a predictor
and a criterion variable would be inappropriate if there is actually a
nonlinear relationship between the two variables. Using graphical tools
first, it would be possible for the researcher in such a case to see the
nonlinearities, and appropriately account for them in the model. Perhaps
one of the most eye-opening examples of the dangers in not plotting data
can be found in Anscombe (1973). In this classic paper, Anscombe shows
the results of four regression models that are essentially equivalent in
terms of the means and standard deviations of the predictor and criterion
variable, with the same correlation between the regressor and outcome
variables in each dataset. However, plots of the data reveal drastically
different relationships among the variables. Figure 6.1 shows these four
datasets and the regression equation and the squared multiple correlation
for each. First, note that the regression coefficients are identical across the
models, as are the squared multiple correlation coefficients. However, the
actual relationships between the independent and dependent variables
are drastically different! Clearly, these data do not come from the same
data-generating process. Thus, modeling the four situations in the same
fashion would lead to mistaken conclusions regarding the nature of
the relationships in the population. The moral of the story here is clear:
plot your data!
The plotting capabilities in R are truly outstanding. It is capable of
producing high-quality graphics with a great deal of flexibility. As a simple
example, consider the Ansombe data from Figure 6.1. These data are
actually included with R and can be loaded into a session with the following
command:
data(anscombe)
DOI: 10.1201/b23166-6 89
90 Multilevel Modeling Using R
Y
6
4
4
2 2
4 6 8 10 12 14 4 6 8 10 12 14
X X
12 Y^ = 3 + X 0.5 Y^ = 3 + X 0.5
12
R2 = 0.67 R2 = 0.67
10 10
Y
8 8
6 6
4 4
4 6 8 10 12 14 8 10 12 14 16 18
X X
FIGURE 6.1
Plot of the Ansombe (1973) data illustrates that the same set of summary statistics does not
necessarily reveal the same type of information.
anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
Graphing Data in Multilevel Contexts 91
The way in which one can plot the data for the first dataset (i.e., x1 and y1
above) is as follows:
plot(anscombe$y1 ~ anscombe$x1)
Notice here that $y1 extracts the column labeled y1 from the data frame, as
$x1 extracts the variable x1. The ~ symbol in the function call positions
the data to the left as the dependent variable and plotted on the ordinate
(y-axis), whereas the value to the right is treated as an independent variable
and plotted on the abscissa (x-axis). Alternatively, one can rearrange the terms
so that the independent variables come first, with a comma separating it from
the dependent variable:
plot(anscombe$x1, anscombe$y1)
Both the former and the latter approaches lead to the same plot.
There are many options that can be called upon within the plotting
framework. The plot function has six base parameters, with the option of
calling other graphical parameters including, among others, par function.
The par function has more than 70 graphical parameters that can be used to
modify a basic plot. Discussing all of the available plotting parameters is
beyond the scope of this chapter, however. Rather, we will discuss some of
the most important parameters to consider when plotting in R.
The parameters ylim and xlim modify the starting and ending points for
the y-and x-axes, respectively. For example, ylim=c(2, 12) will produce a
plot with the y-axis being scaled from 2 until 12. R typically automates this
process, but it can be useful for the researcher to tweak this setting such as by
setting the same axis across multiple plots. The ylab and xlab parameters
are used to create labels for the y- and x-axes, respectively. For example,
ylab="Dependent Variable" will produce a plot with the y-axis labeled
“Dependent Variable”. The main parameter is used for the main title of a plot.
Thus, main="Plot of Observed Values" would produce a plot with the
title “Plot of Observed Values” above the graph. There is also a sub-parameter
that provides a subtitle that is on the bottom of a plot, centered and blow the
xlab label. For example, sub="Data from Study 1" would produce such a
subtitle. In some situations, it is useful to include text, equations, or a
combination in a plot. Text is quite easy to include, which involves using the
text function. For example, text(5, 8, "Include This Text")
would place the text “Include This Text”, in the plot centered at x=5 and y=8.
Equations can also be included in graphs. Doing so requires the use of
expression within the text function call. The function expression allows
for the inclusion of an unevaluated expression (i.e., it displays what is written
out). The particular syntax for various mathematical expression is available via
calling help for the plotmath function (i.e., ?plotmath). R provides a
demonstration of the plotmath functionality via demo(plotmath), which
92 Multilevel Modeling Using R
The values of 5.9 (on the x-axis) and 9.35 (on the y-axis) are simply where
we thought the text looked best and can be easily adjusted to suit the user’s
preferences.
Combining actual text and mathematical expressions requires using the
paste function in conjunction with the expression function. For
example, if we wanted to add “The Value of R2 =0.67”, we would replace
the previous text syntax with the following:
Here, paste is used to bind together the text, which is contained within the
quotes, and the mathematical expression. Note that sep="" is used so that
there are no spaces added between the parts that are pasted together.
Although we did not include it in our figure, the model implied regression
line can be easily included in a scatterplot. One way to do this is to through
the abline function, and include the intercept (denoted a) and the slope
(denoted b). Thus, abline(a=3, b=0.5) would add a regression line to
the plot that has an intercept at 3 and a slope of 0.5. Alternatively, to
automate things a bit, using abline(lm.object) will extract the inter-
cept and slope and include the regression line in our scatterplot.
Finally, notice in Figure 6.2 that we have broken the y-axis to make very
clear that the plot does not start at the origin. Whether or not this is needed
may be debatable; note that base R does not include this option, but we
prefer to include them in many situations. Such a broken axis can be
identified with the axis.break function from the plotrix package. The
zigzag break is requested via the style="zigzag" option in axis.break,
and the particular axis (1 for y and 2 for x). By default, R will set the axis to a
point that is generally appropriate. However, when the origin is not shown,
there is no break in the axis by default as some have argued is important.
Now, let us combine the various pieces of information that we have
discussed in order to yield Figure 6.2, which was generated with the
following syntax.
data(anscombe)
^
10 Y = 3 + X 0.5
The value of R 2 is .67
8
Y
4 6 8 10 12 14
X
FIGURE 6.2
Plot of Anscombe’s Data Set 1, with the regression line of best fit included, axis breaks, and text
denoting the regression line and value of the squared multiple correlation coefficient.
text(5.9, 10.15,
expression(italic(hat(Y))==3+italic(X)*.5))
20 30 40 50 60 70
4.0
3.5
3.0
GPA
2.5
2.0
70
60
50
CTA_Total
40
30
20
40
35
30
BS_Total 25
20
15
10
2.0 2.5 3.0 3.5 4.0 10 15 20 25 30 35 40
FIGURE 6.3
Pairs plot of the Cassady GPA data. The plot shows the bivariate scatterplot for each of the
three variables.
Graphing Data in Multilevel Contexts 95
pairs(
cbind(GPA=Cassady$GPA, CTA_Total=Cassady$CTA.tot,
BS_Total=Cassady$BStotal))
Given that our data contains p variables, there will be p*(p−1)/2 unique
scatterplots (here 3). The plots below the “principle diagonal” are the same
as those above the principle diagonal, with the only difference being the x
and y axes are reversed. Such a pairs plot allows multiple bivariate
relationships to be visualized simultaneously. Of course, one can quantify
the degree of linear relation with a correlation. Code to do this can be given
as follows, using listwise deletion:
cor(na.omit(
cbind(GPA=Cassady$GPA, CTA_Total=Cassady$CTA.tot,
BS_Total=Cassady$BStotal)))
There are other options available for dealing with missing data (see ?cor),
but we have used the na.omit function wrapped around the cbind
function so as to obtain a listwise deletion dataset in which the following
correlation matrix is computed.
hist(resid.1,
freq=FALSE, main="Density of Residuals for Model 1",
xlab="Residuals")
lines(density(resid.1))
96 Multilevel Modeling Using R
0.8
0.6
Density
0.4
0.2
0.0
FIGURE 6.4
Histogram with overlaid density curve of the residuals from the GPA model from Chapter 1.
This code first requests that a histogram be produced for the residuals. Note
that freq=FALSE is used, which instructs R to make the y-axis scaled in
terms of probability, not the default, which is frequency. The solid line
represents the density estimate of the residuals, corresponding closely to the
bars in the histogram.
Rather than a histogram and overlaid density, a more straightforward
way of evaluating the distribution of the errors to the normal distribution is
a quantile-quantile plot (QQ plot). This plot includes a straight line
reflecting the expected distribution of the data if it is normal. In addition,
the individual data points are represented as dots in the figure. When the
data follow a normal distribution, the dots will fall along the straight line, or
very close to it. The following code will produce the QQ plot that appears in
Figure 6.5, based on the residuals from the GPA model.
Notice that above we use the scale function, which standardizes the
residuals to have a mean of zero (already done due to the regression model)
and a standard deviation of 1.
We can see in Figure 6.5 that points in the QQ plot diverge from the line
for the higher end of the distribution. This is consistent with the histogram
in Figure 6.4 that shows a shorter tail on the high end as compared to the
Graphing Data in Multilevel Contexts 97
1
Sample Quantiles
-1
-2
-3
-3 -2 -1 0 1 2 3
Theoretical Quantiles
FIGURE 6.5
Plot comparing the observed quantiles of the residuals to the theoretical quantiles from the
standard normal distribution. When points fall along the line (which has slope 1 and goes
through the origin), then the sample quantiles follow a normal distribution.
low end of the distribution. This plot, along with the histogram, reveals that
the model residuals do diverge from a perfectly normal distribution. Of
course, the degree of non-normality can be quantified (e.g., with skewness
and kurtosis measures), but for this chapter we are most interested in
visualizing the data, and especially in looking for gross violations of
assumptions. We do need to note here that there is a grey-area between
an assumption being satisfied, and there being a “gross violation”. At this
point in the discussion, we leave the interpretation to the reader and
provide information on how such visualizations can be made.
This R command literally identifies the rows in which the corp is equal to
940 (equality checks require two equal signs) and in which the school is
equal to 767. Once we have this individual school isolated, we can visualize
the geread variable using the dotplot function.
dotplot(
class ~ geread,
data=Achieve.940.767, jitter.y = TRUE, ylab="Classroom",
main="Dotplot of \'geread\' for Classrooms in School 767,
Which is Within Corporation 940")
The class ~ geread part of the code instructs the function to plot geread for
each of the classrooms. The jitter.y parameter is used, which will jitter,
or slightly shift around, overlapping data points in the graph. For example,
if multiple students in the same classroom have the same score for geread,
using jitter will shift those points on the y-axis to make clear that there are
multiple values at the same x value. Finally, as before, labels can be
specified. Note that the use of \'geread\' in the main title is to put
geread in a single quote in the title. Calling the dotplot function using
these R commands yields Figure 6.6.
This dotplot shows the dispersion of geread for the classrooms in
school 767. From this plot, we can see that students in each of the four
classrooms had generally similar reading achievement scores. However, it
is also clear that classrooms 2 and 4 each have multiple students with
outlying scores that are higher than those of other individuals within the
school. We hope that it is clear how a researcher could make use of this
type of plot when examining the distributions of scores for individuals at
a lower level of data (e.g., students) nested within a higher level (e.g.,
classrooms or schools).
Because the classrooms within a school are arbitrarily numbered, we can
alter the order in which they appear in the graph in order to make the
display more meaningful. Note that if the order of the classrooms had not
been arbitrary in nature (e.g., Honors classes were numbered 1 and 2),
then we would need to be very careful about changing the ordering.
However, in this case, no such concerns are necessary. In particular, with
100 Multilevel Modeling Using R
3
Classroom
0 2 4 6 8 10 12
geread
FIGURE 6.6
Dotplot of classrooms in school 767 (within corporation 940) for geread.
the use of the reorder function on the left-hand side of the ~ symbol will
reorder the classrooms in ascending order of the variable of interest (here,
geread) in terms of the mean. Thus, we can modify Figure 6.6 in order to
have the classes placed in descending order by the mean of geread.
dotplot(
reorder(class, geread) ~ geread,
data=Achieve.940.767, jitter.y = TRUE, ylab="Classroom",
main="Dotplot of \'geread\' for Classrooms in School 767,
Which is Within Corporation 940")
4
Classroom
0 2 4 6 8 10 12
geread
FIGURE 6.7
Dotplot of classrooms in school 767 (within corporation 940) for geread, with the classrooms
ordered by the mean (lowest to highest).
6055
4455
3125
5495
5930
3695
2725
7215
2960
1160
5360
3435
6755
5375
7995
940
8050
8305
5620
8215
4345
8360
5350
6835
8045
5385
2940
4590
7495
2305
4790
8115
4670
0 2 4 6 8 10 12
geread
FIGURE 6.8
Dotplot of students in corporations, with the corporations ordered by the mean (lowest to
highest).
mark them as unique across the entire dataset. Therefore, if we would like to
focus on achievement at the classroom level, we must first create a unique
classroom number. We can use the following R code in order to create such a
unique identifier, which augments to the Achieve data with a new column of
the unique classroom identifiers, which we call Classroom_Unique.
After forming this unique identifier for the classrooms, we will then
aggregate the data within the classrooms in order to obtain the mean of
Graphing Data in Multilevel Contexts 103
Of course, we still know the nesting structure of the classrooms within the
schools and the schools within the corporations. We are aggregating here
for purposes of plotting, but not modeling the data. We want to remind
readers of the potential dangers of aggregation bias discussed earlier in this
chapter. With this caveat in mind, consider Figure 6.9, which shows that
within the corporations, classrooms do vary in terms of their mean level of
achievement (i.e., the within line/corporation spread) as well as between
corporations (i.e., the change in the lines).
We can also use dotplots to gain insights into reading performance within
specific school corporations. But, again, this would yield a unique plot, such
as the one above, for each corporation. Such a graph may be useful when
interest concerns a specific corporation, or when one wants to assess
variability in specific corporations.
We hope to have demonstrated that dotplots can be useful tools for
gaining an understanding of the variability that exists, or that does not exist,
in the variable(s) of interest. Of course, looking only at a single variable can
be limiting. Another particularly useful function for multilevel data that can
be found in the lattice package is xyplot. This function creates a graph
very similar to a scatterplot matrix for a pair of variables, but accounting for
the nesting structure in the data. For example, the following code produces
an xyplot for geread (y-axis) by gevocab (x-axis), accounting for school
corporation.
104 Multilevel Modeling Using R
6055
3025
4455
3125
5495
2725
5930
2960
1160
5360
3435
6755
7215
3695
5375
7995
940
8050
8305
5620
4345
7495
5350
8045
8215
8360
5385
6835
2940
4590
2305
4790
8115
4670
4 6 8
geread
FIGURE 6.9
Dotplot of geread for corporations, with the corporations ordered by the mean (lowest to
highest), of the classroom aggregated data. The dots in the plot are the mean of the classroom
scores within each of the corporations.
Notice that here the | symbol defines the grouping/nesting structure, with
the ~ symbol implying that geread is predicted/modeled by gevocab. By
default, the specific names of the grouping structure (here the corporation
numbers) are not plotted on the “strip”. To produce Figure 6.10, which
appears below, we added the strip argument with the following options to
the above code.
strip=strip.custom(strip.names=FALSE, strip.levels=c(FALSE,
TRUE))
Graphing Data in Multilevel Contexts 105
0
{ 3640 } { 3695 } { 4345 } { 4455 } { 4590 } { 4670 } { 4790 } { 4805 }
12
8
4
0
{ 3060 } { 3125 } { 3145 } { 3295 } { 3330 } { 3435 } { 3445 } { 3500 }
12
8
4
0
{ 1900 } { 2305 } { 2400 } { 2725 } { 2940 } { 2950 } { 2960 } { 3025 }
12
8
4
0
{ 940 } { 1160 } { 1180 } { 1315 } { 1405 } { 1655 } { 1805 } { 1820 }
12
8
4
0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
gevocab
FIGURE 6.10
Xyplot of geread (y-axis) as a function of gevocab (x-axis) by corporation.
Our use of the optional strip argument adds the corporation number to
the graph and removes the “corp” variable name itself from each “strip”
above all of the bivariate plots, which itself was removed with the
strip.names=FALSE subcommand.
Of course, any sort of specific conditioning of interest can be done for a
particular graph. For example, we might want to plot the schools in, say,
corporation 940, which can be done by extracting from the Achieve data
only corporation 940, as is done below to produce Figure 6.11.
10
0
geread
{ 767 } { 785 }
12
10
2 4 6 8 10
gevocab
FIGURE 6.11
Xyplot of geread (y-axis) as a function of gevocab (x-axis) by school within corporation 940.
We have now discussed two lattice functions that can be quite useful for
visualizing grouped/nested data. An additional plotting strategy involves
the assessment of the residuals from a fitted model, as doing so can help
discern when there are violations of assumptions, much as we saw earlier in
this chapter when discussing single-level regression models. Because
residuals are assumed to be uncorrelated with any of the grouping
structures in the model, they can be plotted using the R functions that we
discussed earlier for single-level data. For example, to create a histogram
with a density line plot of the residuals, we first standardized the residuals
and used code as we did before. Figure 6.12 is produced with the following
syntax:
Graphing Data in Multilevel Contexts 107
0.7
0.6
0.5
0.4
Density
0.3
0.2
0.1
0.0
-4 -2 0 2 4
Standardized Residuals
FIGURE 6.12
Histogram and density plot for standardized residuals from Model 3.1.
hist(scale(resid(Model3.1)),
freq=FALSE, ylim=c(0, .7), xlim=c(-4, 5),
main="Histogram of Standardized Residuals from Model 3.1",
xlab="Standardized Residuals")
lines(density(scale(resid(Model3.1))))
box()
The only differences in the way that we plotted residuals with hist earlier in the
chapter are purely cosmetic in nature. In particular, here we use the box function
to draw a box around the plot and specified the limits of the y- and x-axis.
Alternatively, a QQ plot can be used to evaluate the assumption of
normality, as described earlier in the chapter. The code to do such a plot is
shown in Figure 6.13.
qqnorm(scale(resid(Model3.1)))
qqline(scale(resid(Model3.1)))
Clearly, the QQ plot (and the associated histogram) illustrates that there are
issues on the high end of the distribution of residuals. The issue, as it turns
out, is not so uncommon in educational research: ceiling effects. In
particular, an examination of the previous plots we have created reveals
that there are a non-trivial number of students who achieved the maximum
108 Multilevel Modeling Using R
2
Sample Quantiles
-4 -2 0 2 4
Theoretical Quantiles
FIGURE 6.13
QQ plot of the standardized residuals from Model 3.1.
Scaled residuals:
Min 1Q Median 3Q Max
−2.1620 −0.6063 −0.1989 0.3045 4.8067
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1063 0.326
Residual 3.8986 1.974
Number of obs: 10765, groups: school, 163
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.9520863 0.0514590 37.94
npaverb 0.0439044 0.0007372 59.56
From these results, we can see that the fixed effect npaverb has a
statistically significant positive relationship with geread. Using the effects
package, we can visualize this relationship in the form of the line of best fit,
with an accompanying confidence interval. In order to obtain this graph, we
will use the command sequence below.
library(effects)
plot(predictorEffects(Model6.1))
110 Multilevel Modeling Using R
5
gdaere
0 20 40 60 80 100
npaverb
The resulting line plot reflects the relationship described numerically by the
npaverb slope of 0.0439. In order to plot the relationship, we simply
employ the predictorEffects command in R.
Multiple dependent variables can be plotted at once using this basic
command structure. In Model 6.2, we include school socioeconomic status
(ses) in our model.
Scaled residuals:
Min 1Q Median 3Q Max
−2.1458 −0.6085 −0.1988 0.3057 4.7966
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09759 0.3124
Residual 3.89872 1.9745
Number of obs: 10765, groups: school, 163
Graphing Data in Multilevel Contexts 111
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.6624055 0.1076872 15.437
npaverb 0.0435635 0.0007467 58.345
ses 0.0042901 0.0014131 3.036
From these results, we see that higher levels of school SES are associated
with higher reading test scores, as are higher scores on the verbal reasoning
test. We can plot these relationships and the associated confidence intervals
simultaneously as below.
plot(predictorEffects(Model6.2))
112 Multilevel Modeling Using R
It is also possible to plot only one of these relationships per graph. We also
added a more descriptive title for each of the following plots, using the
main subcommand.
5
geread
0 20 40 60 80 100
npaverb
One very useful aspect of these plots is the inclusion of the 95% confidence
region around the line. From this, we can see that our confidence regarding
the actual nature of the relationships in the population is much greater for
verbal reasoning than for school SES. Of course, given the much larger
level-1 sample size, this makes perfect sense.
It is also possible to include categorical independent variables along with
their plots, as with student gender in Model 6.3.
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.106 0.3256
Residual 3.903 1.9756
Number of obs: 10720, groups: school, 163
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.9913008 0.0778283 25.586
npaverb 0.0439564 0.0007387 59.506
gender −0.0276462 0.0383500 −0.721
library(effects)
plot(predictorEffects(Model6.3, ~gender))
Graphing Data in Multilevel Contexts 115
Scaled residuals:
Min 1Q Median 3Q Max
−2.2001 −0.6070 −0.1904 0.3154 4.8473
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09077 0.3013
Residual 3.74551 1.9353
Number of obs: 10762, groups: school, 163
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.192e+00 8.592e−02 25.516
npaverb 1.923e-02 1.761e−03 10.920
npanverb 2.896e−03 1.639e−03 1.766
npaverb:npanverb 2.678e−04 2.692e−05 9.950
plot(predictorEffects(Model6.4, ~npaverb*npanverb))
116 Multilevel Modeling Using R
The interaction effect is presented in two plots, one for each variable. The set of
line plots on the right focuses on the relationship between npaverb and
geread, at different levels of npanverb. Conversely, the set of figures on the
left switches the focus to the relationship of npanverb and reading score at
different levels of npaverb. Let’s focus on the right-side plot. We see that R
has arbitrarily plotted the relationship between npanverb and geread at five
levels of npaverb, 1, 30, 50, 70, and 100. The relationship between npanverb
and geread is very weak when npaverb=1, and increases in value (i.e., the
line becomes steeper) as the value of npaverb increases in value. Therefore,
we would conclude that nonverbal reasoning is more strongly associated with
reading test scores for individuals who exhibit a higher level of verbal
reasoning. For a discussion as to the details of how these calculations are
made, the reader is referred to the effects package documentation (https://
cran.r-project.org/web/packages/effects/effects.pdf).
Graphing Data in Multilevel Contexts 117
Scaled residuals:
Min 1Q Median 3Q Max
−2.3007 −0.6041 −0.1930 0.3169 4.8402
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.08988 0.2998
Residual 3.73981 1.9339
Number of obs: 10758, groups: school, 163
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.028e+00 1.594e−01 12.722
npaverb 2.162e−02 3.665e−03 5.899
npanverb 1.004e−02 3.577e−03 2.808
npamem 2.796e−03 2.959e−03 0.945
npaverb:npanverb 1.235e−04 6.032e−05 2.048
npaverb:npamem −3.065e−05 6.288e−05 −0.487
npanverb:npamem −1.157e−04 5.986e−05 −1.933
npaverb:npanverb:npamem 2.169e−06 9.746e−07 2.225
plot(predictorEffects(Model6.5, ~npaverb*npanverb*npamem))
118 Multilevel Modeling Using R
Graphing Data in Multilevel Contexts 119
For the two-way interaction that was featured in Model 6.4, we saw that
the relationship between npanverb and geread was stronger for higher
levels of npaverb. If we focus on the relationship between npanverb and
geread in the three-way interaction, we see that this pattern continues to
hold, and that it is amplified somewhat for larger values of npamem. In
other words, the relationship between nonverbal reasoning and reading test
scores is stronger for larger values of verbal reasoning, and stronger still for
the combination of higher verbal reasoning and higher memory scores.
Summary
The focus of this chapter was on graphing multilevel data. Exploration of
data using graphs is always recommended for any data analysis problem
and can be particularly useful in the context of multilevel modeling, as we
have seen here. We saw how a scatterplot matrix can provide insights into
relationships among variables that may not be readily apparent from a
simple review of model coefficients. In addition, we learned of the power of
dotplots to reveal interesting patterns at multiple levels of the data
structure. In particular, with dotplots, we were able to visualize mean
differences among classrooms in a school, as well as among individuals
within a classroom. Finally, graphical tools can also be used to assess the
important assumptions underlying linear models in general, and multilevel
models in particular, including normality and homogeneity of residual
variance. In short, analysts should always be mindful of the power of
pictures as they seek to understand relationships in their data.
7
Brief Introduction to Generalized
Linear Models
p (Y = 1)
ln = 0 + 1 x. (7.1)
1 p (Y = 1)
122 Multilevel Modeling Using R
coronary.logistic<-glm(group~time, family=binomial)
need to know which category was defined as the target by the software so
that we can properly interpret the model parameter estimates. By default,
the glm command will treat the higher value as the target. In this case,
0=Healthy and 1=Disease. Therefore, the numerator of the logit will be 1, or
Disease. It is possible to change this so that the lower number is the target,
and the interested reader can refer to help(glm) for more information in
this regard. This is a very important consideration, as the results would be
completely misinterpreted if R used a different specification than the user
thinks was used. The results of the summary command appear below.
Call:
glm(formula = group ~ time, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
−2.1387 −0.3077 0.1043 0.5708 1.5286
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.488949 5.876693 2.295 0.0217 *
coronary$time −0.016534 0.007358 −2.247 0.0246 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
having heart disease are multiplied by 0.984. Thus, for an additional minute
of walking, the odds decrease by exp(−.016534*60)=0.378
Adjacent to the coefficient column is the standard error, which measures
the sampling variation in the parameter estimate. The estimate divided by the
standard error yields the test statistic, which appears under the z column. This
is the test statistic for the null hypothesis that the coefficient is equal to 0. Next
to z is the p-value for the test. Using standard practice, we would conclude
that a p-value less than 0.05 indicates statistical significance. In addition, R
provides a simple heuristic for interpreting these results based on the *. For
this example, the p-value of 0.0246 for time means that there is a statistically
significant relationship between time on the treadmill to fatigue and the odds
of an individual having coronary artery disease. The negative sign for the
estimate further tells us that more time spent on the treadmill was associated
with a lower likelihood of having heart disease.
One common approach to assessing the quality of the model fit to the data
is by examining the deviance values. For example, the Residual Deviance
compares the fit between a model that is fully saturated, meaning that it
perfectly fits the data and our proposed model. Residual Deviance is
measured using a χ2 statistic that compares the predicted outcome value
with the actual value, for each individual in the sample. If the predictions
are very far from the actual responses, this χ2 will tend to be a large value,
indicating that the model is not very accurate. In the case of the Residual
Deviance, we know that the saturated model will always provide optimal fit
to the data at hand, though in practice it may not be particularly useful for
understanding the relationship between x and y in the population because it
will have a separate model parameter for every cell in the contingency table
relating the two variables, and thus may not generalize well. The proposed
model will always be more parsimonious than the saturated (i.e., have
fewer parameters), and will therefore generally be more interpretable and
generalizable to other samples from the same population, assuming that it
does in fact provide adequate fit to the data. With appropriately sized
samples, the residual deviance can be interpreted as a true χ2 test, and the
p-value can be obtained in order to determine whether the fit of the
proposed model is significantly worse than that of the saturated model.
The null hypothesis for this test is that model fit is adequate, i.e., the fit of
the proposed model is close to that of the saturated model. With a very
small sample such as this one, this approximation to the χ2 distribution does
not hold (Agresti, 2002), and we must therefore be very careful in how we
interpret the statistic. For pedagogical purposes, let’s obtain the p-value for
χ2 of 12.966 with 18 degrees of freedom. This value is 0.7936, which is larger
than the α cut-off of 0.05, indicating that we cannot reject that the proposed
model fits the data as well as the saturated model. Thus, we would then
retain the proposed model as being sufficient for explaining the relation-
ships between the independent and dependent variables.
Brief Introduction to Generalized Linear Models 125
The other deviance statistic for assessing fit that is provided by R is the
Null Deviance, which tests the null hypothesis that the proposed model
does not fit the data better than a model in which the average odds of
having coronary artery disease is used as the predicted outcome for every
time value (i.e., that x is not linear predictive of the probability of having
coronary heart disease). A significant result here would suggest that the
proposed model is better than no model at all. Again, however, we must
interpret this test with caution when our sample size is very small, as is the
case here. For this example, the p-value of the Null Deviance test (χ2 = 27.726
with 19 degrees of freedom) was 0.0888. As with the Residual Deviance test,
the result is not statistically significant at α = 0.05, suggesting that the
proposed model does not provide a better fit than the null model with no
relationships. Of course, given the small sample size, we must interpret
both hypothesis tests with some caution.
Finally, R also provides the AIC value for the model. As we have seen in
previous chapters, AIC is a useful statistic for comparing the fit of different,
and not necessarily nested, models with smaller values indicating better
relative fit. If we wanted to assess whether including additional indepen-
dent variables or interactions improved model fit, we could compare AIC
values among the various models to ascertain which was optimal. For the
current example, there are no other independent variables of interest.
However, it is possible to obtain the AIC for the intercept only model using
the following command. The purpose behind doing so would be to
determine whether including the time walking on the treadmill actually
improved model fit, after the penalty for model complexity was applied.
coronary.logistic.null<-glm(group~1, family=binomial)
The AIC for this intercept only model was 29.726, which is larger than the
16.966 for the model including Time. Based on AIC, along with the
hypothesis test results discussed above, we would therefore conclude that
the full model including Time provided a better fit to the outcome of
coronary artery disease.
In the next section, we will work with models that allow the categories to be
unordered in nature.
As a way to motivate our discussion of ordinal logistic regression models,
let’s consider the following example. A dietitian has developed a behavior
management system for individuals suffering from obesity that is designed
to encourage a healthier lifestyle. One such healthy behavior is the
preparation of their own food at home using fresh ingredients rather than
dining out or eating prepackaged foods. Study participants consisted of 100
individuals who were under a physician’s care for a health issue directly
related to obesity. Members of the sample were randomly assigned to either
a control condition in which they received no special instruction in how to
plan and prepare healthy meals from scratch, or a treatment condition in
which they did receive such instruction. The outcome of interest was a
rating provided 2 months after the study began in which each subject
indicated the extent to which they prepared their own meals. The response
scale ranged from 0 (Prepared all of my own meals from scratch) to 4 (Never
prepared any of my own meals from scratch), so that lower values were
indicative of a stronger predilection to prepare meals at home from scratch.
The dietitian is interested in whether there are differences in this response
between the control and treatment groups.
One commonly used model for ordinal data such as these is the
cumulative logits model, which is as expressed as follows:
P (Y j )
logit [P (Y j )] = ln . (7.2)
1 P (Y j )
In this model, there are J-1 logits where J is the number of categories in the
dependent variable, and Y is the actual outcome value. Essentially, this
model compares the likelihood of the outcome variable taking a value of j or
lower, versus outcomes larger than j. For the current example, there would
be four separate logits:
ln ( P (Y = 0)
P (Y = 1) + P (Y = 2) + P (Y = 3) + P (Y = 4) )= 01 + 1x
ln ( )=
P (Y = 0) + P (Y = 1)
P (Y = 2) + P (Y = 3) + P (Y = 4) 02 + 1x
(7.3)
ln ( )=
P (Y = 0) + P (Y = 1) + P (Y = 2)
P (Y = 3) + P (Y = 4) 03 + 1x
ln ( )=
P (Y = 0) + P (Y = 1) + P (Y = 2) + P (Y = 3)
P (Y = 4) 04 + 1x
the proportional odds assumption, which states that this slope is identical
across logits. In order to fit the cumulative logits model to our data in R, we
use the polr function, as in this example.
cooking.cum.logit<-polr(cook~treatment, method=c(“logistic”))
Call:
polr(formula = cook ~ treatment, method = c("logistic"))
Coefficients:
Value Std. Error t value
treatment −0.7963096 0.3677003 −2.165649
Intercepts:
Value Std. Error t value
0|1 −2.9259 0.4381 −6.6783
1|2 −1.7214 0.3276 −5.2541
2|3 −0.2426 0.2752 −0.8816
3|4 1.3728 0.3228 4.2525
Residual Deviance: 293.1349
AIC: 303.1349
After the function call, we see the results for the independent variable,
treatment. The coefficient value is −0.796, indicating that a higher value
on the treatment variable (i.e., treatment=1) was associated with a greater
likelihood of providing a lower response on the cooking item. Remember
that lower responses to the cooking item reflected a greater propensity to
eat scratch made food at home. Thus, in this example, those in the treatment
conditions had a greater likelihood of eating scratch-made food at home.
Adjacent to the coefficient value is the standard error for the slope, which is
divided into the coefficient in order to obtain the t-statistic residing in the
final column. We note that there is not a p-value associated with this
t statistic, because in the GLiM context, this value only follows the
t distribution asymptotically (i.e., for large samples). In other cases, it is
simply an indication of the relative magnitude of the relationship between
the treatment and outcome variable. In this context, we might consider the
relationship to be “significant” if the t value exceeds 2, which is approxi-
mately the t critical value for a two-tailed hypothesis test with α = 0.05 and
infinite degrees of freedom. Using this criterion, we would conclude that
there is indeed a statistically significant negative relationship between
128 Multilevel Modeling Using R
1-pchisq(deviance(cooking.cum.logit),
df.residual(cooking.cum.logit))
[1] 0
The p-value is extremely small (rounded to 0), indicating that the model as a
whole does not provide a very good fit to the data. This could mean that to
obtain a better fit, we need to include more independent variables with a
strong relationship to the dependent. However, if our primary interest is in
determining whether there are treatment differences in cooking behavior, then
this overall test of model fit may not be crucial, since we are able to answer the
question regarding the relationship of treatment to the cooking behavior item.
p (Y = i )
ln = i0 + i1 x (7.4)
p (Y = j )
In this model, category j will always serve as the reference group against
which the other categories, i, are compared. There will be a different logit
for each non-reference category, and each of the logits will have a unique
intercept ( i0 ) and slope ( i1). Thus, unlike with the cumulative logits model
in which a single slope represented the relationship between the indepen-
dent variable and the outcome, in the multinomial logits model we have
multiple slopes for each independent variable, one for each logit. Therefore,
we do not need to make the proportional odds assumption, which makes
this model a useful alternative to the cumulative logits model when that
assumption is not tenable. The disadvantage of using the multinomial logits
model with an ordinal outcome variable is that the ordinal nature of the
data is ignored. Any of the categories can serve as the reference, with the
decision being based on the research question of most interest (i.e., against
which group would comparisons be most interesting), or on pragmatic
concerns such as which group is the largest, should the research questions
not serve as the primary deciding factor. Finally, it is possible to compare
the results for two non-reference categories using the equation
P (Y = i ) P (Y = i ) P (Y = m )
ln = ln ln . (7.5)
P (Y = m ) P (Y = j ) P (Y = j )
For the present example, we will set the conservative group to be the reference,
and fit a model in which age is the independent variable and political
viewpoint is the dependent variable. In order to do so, we will use the
function mulitnom within the nnet package, which will need to be installed
prior to running the analysis. We would then use the command library
(nnet) to make the functions in this library available. The data were read into
the R data frame politics, containing the variables age and viewpoint,
which were coded as C (conservative), M (moderate), or L (liberal) for each
individual in the sample. Age was expressed as the number of years of
age. The R command to fit the multinomial logistic regression model is
130 Multilevel Modeling Using R
politics.multinom<-multinom(viewpoint~age,data=politics),
producing the following output.
# weights: 9 (4 variable)
initial value 1647.918433
final value 1617.105227
converged
This message simply indicates the initial and final values of the maximum
likelihood fitting function, along with the information that the model converged.
In order to obtain the parameter estimates and standard errors, we use the
summary(politics.multinom)
Call:
multinom(formula = viewpoint ~ age, data = politics)
Coefficients:
(Intercept) age
L 0.4399943 −0.016611846
M 0.3295633 −0.004915465
Std. Errors:
(Intercept) age
L 0.1914777 0.003974495
M 0.1724674 0.003415578
Residual Deviance: 3234.210
AIC: 3242.210
Based on these results, we see that the slope relating age to the logit
comparing self identification as liberal (L) is −0.0166, indicating that older
individuals had a lower likelihood of being liberal versus conservative. In
order to determine whether this relationship is statistically significant, we
can calculate a 95% confidence interval using the coefficient and the
standard error for this term. This interval is constructed as
0.0166 ± 2(0.0040)
0.0166 ± 0.008
( 0.0246, 0.0086)
Because 0 is not in this interval, it is not a likely value of the coefficient in the
population, leading us to conclude that the coefficient is statistically
significant. In other words, we can conclude that in the population, older
individuals are less likely to self-identify as liberal than as conservative. We
Brief Introduction to Generalized Linear Models 131
can also construct a confidence interval for the coefficient relating age to the
logit for moderate to conservative:
0.0049 ± 2(0.0034)
0.0049 ± 0.068
( 0.0117, 0.0019)
Thus, because 0 does lie within this interval, we cannot conclude that there
is a significant relationship between age and the logit. In other words, age is
not related to the political viewpoint of an individual when it comes to
comparing moderate versus conservative. Finally, we can calculate esti-
mates for comparing L and M by applying equation 7.5.
ln ( P (Y = L )
P (Y = M ) ) = ln ( P (Y = L)
P (Y = C ) ) ln ( P (Y = M )
P (Y = C ) )
= (0.4400 0.0166(age)) (0.3300 0.0049(age))
= 0.4400 0.3300 0.0166(age) + 0.0049(age)
= 0.1100 0.0117(age)
Taken together, we would conclude that older individuals are less likely to
be liberal than conservative, and less likely to be liberal than moderate.
ln (Y ) = 0 + 1 x. (7.6)
In all other respects, the Poisson model is similar to other regression models
in that the relationship between the independent and dependent variables is
expressed through the slope, 1. And again, the assumption underlying the
Poisson model is that the mean is equal to the variance. This assumption is
typically expressed by stating that the overdispersion parameter, ϕ = 1. The ϕ
parameter appears in the Poisson distribution density and thus is a key
component in the fitting function used to determine the optimal model
parameter estimates in maximum likelihood. A thorough review of this
fitting function is beyond the scope of this book. Interested readers are
referred to Agresti (2002) for a complete presentation of this issue.
Estimating the Poisson regression model in R can be done with the GLM
function that we used previously for dichotomous logistic regression.
Consider an example in which a demographer is interested in determining
whether there exists a relationship between the socioeconomic status ( sei)
of a family and the number of children under the age of 6 months ( babies)
that are living in the home. We first read the data and name it ses_babies.
We then attach it using attach(ses_babies). In order to see the
distribution of the number of babies, we can use the command hist
(babies), with the resulting histogram appearing below.
1200
1000
800
frequency
600
400
200
We can see that 0 was the most common response by individuals in the
sample, with the maximum number being 3.
In order to fit the model with the glm function, we would use the
following function call.
Call:
glm(formula = babies ~ sei, family = c("poisson"), data = ses_babies)
Deviance Residuals:
Min 1Q Median 3Q Max
−0.7312 −0.6914 −0.6676 −0.6217 3.1345
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −1.268353 0.132641 −9.562 <2e−16 ***
sei −0.005086 0.002900 −1.754 0.0794 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1237.8 on 1496 degrees of freedom
Residual deviance: 1234.7 on 1495 degrees of freedom
(3 observations deleted due to missingness)
AIC: 1803
Number of Fisher Scoring iterations: 6
These results show that sei did not have a statistically significant
relationship with the number of children under 6 months old living in the
home (p=0.0794). We can use the following command to obtain the p-value
for the test of the null hypothesis that the model fits the data.
1-pchisq(deviance(babies.poisson),
df.residual(babies.poisson))
[1] 0.9999998
Call:
glm(formula = babies ~ sei, family = c("quasipoisson"), data =
ses_babies)
Deviance Residuals:
Min 1Q Median 3Q Max
−0.7312 −0.6914 −0.6676 −0.6217 3.1345
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) −1.268353 0.150108 −8.45 <2e−16 ***
sei −0.005086 0.003282 −1.55 0.121
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.280709)
Null deviance: 1237.8 on 1496 degrees of freedom
Residual deviance: 1234.7 on 1495 degrees of freedom
(3 observations deleted due to missingness)
AIC: NA
Number of Fisher Scoring iterations: 6
As noted above, the coefficients themselves are the same in the quasipoisson
and Poisson regression models. However, the standard errors in the former
are somewhat larger than those in the latter. In addition, the estimate of ϕ is
provided for the quasipoisson model, and is 1.28 in this case. While this is
Brief Introduction to Generalized Linear Models 135
not exactly equal to 1, it is also not markedly larger, suggesting that the data
are not terribly overdispersed. We can test for model fit as we did with the
Poisson regression, using the command
1-pchisq(deviance(babies.quasipoisson),
df.residual(babies.quasipoisson))
[1] 0.9999998
And, as with the Poisson, the quasipoisson model also fit the data
adequately.
A second alternative to the Poisson model when data are overdispersed is
a regression model based on the negative binomial distribution. The mean
of the negative binomial distribution is identical to that of the Poisson, while
the variance is
2
var (Y ) = + (7.7)
From equation 7.7, it is clear that as θ increases in size, the variance approaches
the mean, and the distribution becomes more like the Poisson. It is possible for
the researcher to provide a value for θ if it is known that the data come from a
particular distribution with a known θ. For example, when θ = 1, the data are
modeled from the Gamma distribution. However, for most applications, the
distribution is not known, in which case θ will be estimated from the data.
The negative binomial distribution can be fit to the data in R using the
glm.nb function that is part of the MASS library. For the current example,
the R commands to fit the negative binomial model and obtain the output
would be
summary(babies.nb)
Call:
glm.nb(formula = babies ~ sei, data = ses_babies, init.theta =
0.60483559440229,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
−0.6670 −0.6352 -0.6158 -0.5778 2.1973
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −1.260872 0.156371 −8.063 7.42e−16 ***
sei −0.005262 0.003386 −1.554 0.120
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
136 Multilevel Modeling Using R
2 x log-likelihood: −1749.395
Just as we saw with the quasipoisson regression, the parameter estimates for
the Negative binomial regression are identical to those for the Poisson. This
fact simply reflects the common mean that the distributions all share.
However, the standard errors for the estimates differ across the three models,
although those for the negative binomial are very similar to those from the
quasipoisson. Indeed, the resulting hypothesis test results provide the same
answer for all three models, that there is not a statistically significant
relationship between the sei and the number of babies living in the
home. In addition to the parameter estimates and standard errors, we obtain
an estimate of θ of 0.605. In terms of determining which model is optimal, we
can compare the AIC from the Negative binomial (1755.4) to that of the
Poisson (1803), to conclude that the former provides a somewhat better fit to
the data than the latter. In short, it would appear that the data are somewhat
overdispersed as the model designed to account for this (negative binomial)
provides a better fit than the Poisson, which assumes no overdispersion.
From a more practical perspective, the results of the two models are very
similar, and a researcher using α = 0.05 would reach the same conclusion
regarding the lack of relationship between sei and the number of babies
living in the home, regardless of which model they selected.
Summary
This chapter marks a major change in direction in terms of the type of data
upon which we will focus. Through the first six chapters, we have been
concerned with models in which the dependent variable is continuous, and
generally assumed to be normally distributed. In this chapter, we learned
about a variety of models designed for categorical dependent variables. In
perhaps the simplest instance, such variables can be dichotomous, so that
logistic regression is most appropriate for data analysis. When the outcome
variable has more than two ordered categories, we saw that logistic
regression can be easily extended in the form of the cumulative logits
model. For dependent variables with unordered categories, the multinomial
Brief Introduction to Generalized Linear Models 137
logits model is the typical choice and can be easily employed with R.
Finally, we examined dependent variables that are counts, in which case we
may choose poisson regression, the quasipoisson model, or the negative
binomial model, depending upon how frequently the outcome being
counted occurs. As with Chapter 1, the goal of this chapter was primarily
to provide an introduction to the single-level versions of the multilevel
models to come. In the next chapter, we will see that the model types
described here can be extended into the multilevel context using our old
friends lme and lmer.
8
Multilevel Generalized Linear Models
(MGLMs)
the mathematics achievement test, for which all examinees are categorized
as either passing (1) or failing (0). Given that the outcome variable is
dichotomous, we could use the binary logistic regression method intro-
duced in Chapter 7. However, students in this sample are clustered by
school, as was the case with the data that were examined in Chapters 3
and 4. Therefore, we will need to appropriately account for this multilevel
data structure in our regression analysis.
library(lme4)
attach(mathfinal)
summary(model8.1<-glmer(score2~numsense+(1|school),family=
binomial, na.action=na.omit, data=mathfinal))
Scaled residuals:
Min 1Q Median 3Q Max
−5.2722 −0.7084 0.2870 0.6448 3.4279
140 Multilevel Modeling Using R
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.2888 0.5374
Number of obs: 9316, groups: school, 40
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −11.659653 0.305358 −38.18 <2e−16 ***
numsense 0.059177 0.001446 40.94 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The function call is similar to what we saw with linear models in Chapter 3.
In terms of interpretation of the results, we first examine the variability in
intercepts from school to school. This variation is presented as both the
variance and standard deviation of the U0j terms from Chapter 2 ( 02 ), which
are 0.2888 and 0.5374, respectively, for this example. The modal value of the
intercept across schools is −11.659653. With regard to the fixed effect, the
slope of numsense, we see that higher scores are associated with a greater
likelihood of passing the state math assessment, with the slope being
0.059177 (p < .05). (Remember that R models the larger value of the outcome
in the numerator of the logit, and in this case passing was coded as 1
whereas failing was coded as 0.) The standard error, test statistic, and
p-value appear in the next three columns. The results are statistically
significant (p<0.001), leading to the conclusion that overall, number sense
scores are positively related to the likelihood of a student achieving a
passing score on the assessment. Finally, we see that the correlation
between the slope and intercept is strongly negative (−0.955). Given that
this is an estimate of the relationship between two fixed effects, we are not
particularly interested in it. Information about the residuals appears at the
very end of the output.
As we discussed in Chapter 3, it is useful for us to obtain confidence
intervals for parameter estimates of both the fixed and random effects in
the model. With lme4, we have several options in this regard, using the
confint function, as we saw in Chapter 3. Indeed, precisely the same
options are available for multilevel generalized models fit using glmer, as
was the case for models fit using the lmer function. We would refer the
reader to Chapter 3 for a review of how each of these methods works. In the
following text, we demonstrate the use of each of these confidence interval
types for Model 8.1, and then discuss the implications of these results for
interpretation of the problem at hand.
Multilevel Generalized Linear Models (MGLMs) 141
#Percentile Bootstrap
#Basic Bootstrap
#Normal Bootstrap
#Wald
confint(model8.1, method=c("Wald"))
2.5 % 97.5 %
.sig01 NA NA
(Intercept) −12.25814446 −11.06116177
numsense 0.05634366 0.06201022
#Profile
confint(model8.1, method=c("profile"))
For the random effect, all of the methods for calculating confidence intervals
yield similar results, in that the lower bound is approximately 0.41–0.42 and
the upper bound is between 0.67 and 0.70. Regardless of the method, we see
that 0 is not in the interval, and thus we would conclude that the random
intercept variance is statistically significant, i.e., there are differences in the
intercept across schools. Similarly, the confidence intervals for the fixed
effects (intercept and the coefficient for numsense) also did not include 0,
indicating that they were statistically significant as well. In particular, this
would lead us to the conclusion that there is a positive relationship between
numsense and the likelihood of receiving a passing test score, which we
noted above.
In addition to the model parameter estimates, the results from glmer
include information about model fit, in particular values for the AIC and
BIC. As we have discussed in previous chapters, these statistics can be used
to compare the relative fit of various models in an attempt to pick the
optimal one, with smaller values indicating a better model fit. As well as
comparing relative model fit, the fit of two nested models created by glmer
can also be compared with one another in the form of a likelihood ratio test
with the anova function. The null hypothesis of this test is that the fit of two
nested models is equivalent, so that a statistically significant result (i.e.,
p≤0.05) would indicate that the models provide a different fit to the data,
with the more complicated (fuller) model typically providing an improve-
ment in fit beyond what would be expected with the additional parameters
added. We will demonstrate the use of this test in the next section.
summary(model8.2<-glmer(score2~numsense+
(numsense|school,family=binomial, data=mathfinal))
Scaled residuals:
Min 1Q Median 3Q Max
−4.7472 −0.6942 0.2839 0.6352 3.6943
Multilevel Generalized Linear Models (MGLMs) 143
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 2.170e+01 4.65843
numsense 4.105e−04 0.02026 −1.00
Number of obs: 9316, groups: school, 40
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −12.901304 0.836336 −15.43 <2e−16 ***
numsense 0.064903 0.003735 17.38 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We will focus on aspects of the output for the random coefficients model
output that differ from that of the random intercepts. In particular, note that
we have an estimate of 12 , the variance of U1j estimates for specific schools. This
value, 0.00041, is relatively small when compared to the variation of intercepts
across schools (0.217), meaning that the relationship of number sense with the
likelihood of an individual receiving a passing score on the math achievement
test is relatively similar across the schools. The modal slope across schools is
0.064274, again indicating that individuals with higher number sense scores
also have a higher likelihood of passing the math assessment. Finally, it is
important to note that the correlation between the random components of the
slope and intercept, the standardized version of τ01, is very strongly negative.
As with the random intercept model, we can obtain confidence intervals
for the random and fixed effects of the random coefficients model. The same
options are available as was the case for the random intercept model. For
this example, we will use the Wald, profile, and percentile bootstrap
confidence intervals.
#Wald
confint(model8.2, method=c("Wald"))
2.5% 97.5%
.sig01 NA NA
.sig02 NA NA
.sig03 NA NA
(Intercept) −14.54049182 −11.26211656
numsense 0.05758284 0.07222221
144 Multilevel Modeling Using R
#Profile
confint(model8.2, method=c("profile"))
#Percentile Bootstrap
2.5% 97.5%
.sig01 3.18387655 6.01887096
.sig02 −0.99869150 −0.99035490
.sig03 0.01367105 0.02621692
(Intercept) −14.88458123 −11.29933632
numsense 0.05781441 0.07362581
None of the confidence intervals for the random effects included 0, leading
us to conclude that each of them is likely to be different from 0 in the
population.
The inclusion of AIC and BIC in the GLMER output allows for a direct
comparison of model fit, thus aiding in the selection of the optimal model
for the data. As a brief reminder, AIC and BIC are both measures of
unexplained variation in the data with a penalty for model complexity.
Therefore, models with lower values provide a relatively better fit.
Comparison of either AIC or BIC between Models 8.1 (AIC=9835.9,
BIC=9857.4) and 8.2 (AIC=9768.7, BIC=9804.4) reveals that the latter
provides a better fit to the data. We do need to remember that AIC and
BIC are not significance tests, but rather are measures of relative model fit.
In addition to the relative fit indices, we can compare the fit of the two
models using the anova command, as we demonstrated in Chapter 3.
anova(model8.1, model8.2)
Data: NULL
Models:
model8.1: score2 ~ numsense + (1 | school)
Multilevel Generalized Linear Models (MGLMs) 145
summary(model8.3<-glmer(score2~numsense+female+L_Free+(1|school),
family=binomial, data=mathfinal, na.action=na.omit))
Scaled residuals:
Min 1Q Median 3Q Max
−4.8756 −0.6792 0.2631 0.6264 3.7674
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.283 0.532
Number of obs: 7066, groups: school, 34
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −12.084501 0.418795 −28.86 <2e−16 ***
numsense 0.063637 0.001736 36.65 <2e−16 ***
female −0.026003 0.057845 −0.45 0.6530
L_Free −0.008653 0.003606 −2.40 0.0164 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These results indicate that being female is not significantly related to one’s
likelihood of passing the math achievement test, i.e., there are no gender
differences in the likelihood of passing. Scores on the numsense subscale were
positively associated with the likelihood of passing the test, and attending a
school with a higher proportion of students on free lunch was associated with a
lower such likelihood. The 95% bootstrap percentile confidence intervals for the
fixed and random effects appear below. The interval for the random intercept
effect does not include 0, meaning that this term is not likely to be 0 in the
population. In other words, there are school-to-school differences in the
likelihood of individuals receiving a passing test grade. In addition, intervals
for the intercept, numsense score, and proportion of students on free lunch all
excluded 0, reinforcing the hypothesis test results in the initial glmer output.
As was the case with Model 8.2, we can estimate a random coefficient
model. In this case, we will estimate the random coefficient for female.
summary(model8.4<-glmer(score2~numsense+female+L_Free+
(female|school),family=binomial, data=mathfinal, na.action=
na.omit))
Scaled residuals:
Min 1Q Median 3Q Max
−4.6517 −0.6782 0.2622 0.6264 3.6392
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 0.2288 0.4783
female 0.0105 0.1025 1.00
Number of obs: 7066, groups: school, 34
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −12.117243 0.416446 −29.097 <2e−16 ***
numsense 0.063693 0.001736 36.694 <2e−16 ***
female −0.021563 0.060778 −0.355 0.7228
L_Free −0.008332 0.003508 −2.375 0.0175 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variance estimate for the random coefficient effect for gender is 0.0105.
The coefficients for the other variables in Model 8.4 are quite similar to those
in Model 8.3. In order to ascertain whether the random coefficient for
148 Multilevel Modeling Using R
These results show that the correlation between the fixed effects (.sig02) is
not different from 0, nor is the coefficient linking female to the outcome
variable, because in both cases the confidence interval includes 0.
We can use the relative fit indices to make some judgments regarding
which of these two models might be optimal for better understanding the
population. Both AIC and BIC are very slightly smaller for Model 8.3
(7064.0 and 7098.1), as compared to Model 8.4 (7065.3 and 7113.1),
indicating that the simpler model (without the female random coefficient
term) might be preferable. In addition, the results of the likelihood ratio test,
which appear below, reveal that the fit of the two models is not statistically
significantly different. Given this statistically equivalent fit, we would
prefer the simpler model, without the random coefficient effect for female.
anova(model8.3, model8.4)
Data: mathfinal
Models:
model8.3: score2 ~ numsense + female + L_Free + (1 | school)
model8.4: score2 ~ numsense + female + L_Free + (female | school)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
model8.3 5 7299.9 7334.2 −3644.9 7289.9
model8.4 7 7301.1 7349.1 −3643.5 7287.1 2.7763 2 0.2495
above. To provide context, let’s again consider the math achievement results
for students. In this case, the outcome variable takes one of three possible
values for each member of the sample: 1=Failure, 2=Pass, 3=Pass with
distinction. In this case, the question of most interest to the researcher is
whether a computation aptitude score is a good predictor of status on the
math achievement test.
summary(model8.5<-clmm(as.factor(score)~computation+(1|school),
data=mathfinal, na.action=na.omit))
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.3844 0.62
Number of groups: school 40
Coefficients:
Estimate Std. Error z value Pr(>|z|)
computation 0.06977 0.00143 48.78 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
1|2 13.6531 0.3049 44.77
2|3 17.2826 0.3307 52.26
One initial point to note is that the syntax for clmm is very similar in form to that
for lmer. As with most R model syntax, the outcome variable (score) is
150 Multilevel Modeling Using R
separated from the fixed effect (computation) by ~, and the random effect,
school, is included in parentheses along with 1, to denote that we are fitting a
random intercepts model. We should also note that the dependent variable needs
to be a factor, leading to our use of as.factor(score) in the command
sequence.
An examination of the results presented above reveals that the variance and
standard deviation of intercepts across schools are 0.3844 and 0.62, respec-
tively. Given that the variation is not near 0, we would conclude that there
appear to be differences in intercepts from one school to the next. In addition,
we see that there is a significant positive relationship between performance on
the computation aptitude subtest and performance on the math achievement
test, indicating that examinees who have higher computation skills are more
likely to attain higher ordinal scores on the achievement test, e.g., pass versus
fail or pass with distinction versus pass. We also obtain estimates of the model
intercepts, which are termed thresholds by clmm. As was the case for the
single-level cumulative logits model, the intercept represents the log odds of
the likelihood of one response versus the other (e.g., 1 versus 2) when the value
of the predictor variable is 0. A computation score of 0 would indicate that the
examinee did not correctly answer any of the items on the test. Applying this
fact to the first intercept presented above, along with the exponentiation of the
intercept that was demonstrated in the previous chapter, we can conclude that
the odds of a person with a computation score of 1 passing the math
achievement exam are e13.6531 = 850, 092.12 to 1, or quite high! Finally, we
also have available the AIC value (13365.74), which we can use to compare the
relative fit of this to other models.
We can fit a multilevel model including a random coefficient effect for our
independent variable, computation. The basic concepts underlying the
random coefficient models that we have discussed previously in the book
(and in this chapter) apply here as well. The R code and resulting output for
this analysis using clmm appears below.
summary(model8.5b<-clmm(as.factor(score)~computation+(computation|school),
data=mathfinal, na.action=na.omit))
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 3.776e+01 6.14530
computation 7.261e−04 0.02695 −0.994
Number of groups: school 40
Multilevel Generalized Linear Models (MGLMs) 151
Coefficients:
Estimate Std. Error z value Pr(>|z|)
computation 0.080161 0.004652 17.23 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
1|2 15.880 1.046 15.19
2|3 19.581 1.055 18.56
(3453 observations deleted due to missingness)
summary(model8.6<-clmm(as.factor(score)~computation+L_Free+(1|school),
data=mathfinal, na.action=na.omit))
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.4028 0.6347
Number of groups: school 34
Coefficients:
Estimate Std. Error z value Pr(>|z|)
computation 0.074606 0.001698 43.940 <2e−16 ***
L_Free −0.007612 0.004099 −1.857 0.0633.
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
1|2 14.2381 0.4288 33.21
2|3 17.8775 0.4545 39.33
152 Multilevel Modeling Using R
Given that we have already discussed the results of the previous model in
some detail, we will not reiterate those basic ideas again. However, it is
important to note those aspects that are different here. Specifically, the
variability in the intercepts declined somewhat with the inclusion of the
school-level variable, L_Free. In addition, we see that there is not a
statistically significant relationship between the proportion of individuals
receiving free lunch in the school and the likelihood that an individual
student will obtain a higher achievement test score. Finally, a comparison of
the AIC values for the computation only model (13365.74) and the
computation and free lunch model (10080.25) shows that Model 8.6
provides a somewhat better fit to the data than does Model 8.5, given its
smaller AIC value. In other words, in terms of the model fit to the data, we
are better off including both free lunch and computation score when
modeling the three-level achievement outcome variable, even though
L_Free is statistically significantly related to the outcome variable. Note
that the anova command is not available for models fit with clmm.
As of the writing of this book (September 2023), lme4 does not provide for
the fitting of multilevel ordinal logistic regression models. Therefore, clmm
function within the ordinal package represents perhaps the most straight-
forward mechanism for fitting such models, albeit with its own limitations.
As can be seen above, the basic fitting of these models is not complex, and
indeed the syntax is similar to that of lme4. In addition, the ordinal
package allows for the fitting of ordered outcome variables in the non-
multilevel context (see the clmm function), and for multinomial outcome
variables (see the clmm2 function, discussed next). As such, it represents
another method available for fitting such models in a unified framework.
context, both conceptually and using R with the appropriate packages. In the
following sections, we will demonstrate analysis of multilevel count data
outcomes in the context of Poisson regression, quasipoisson regression, and
negative binomial regression in R. The example to be used involves the
number of cardiac warning incidents (e.g., chest pain, shortness of breath,
dizzy spells) for 1000 patients associated with 110 cardiac rehabilitation
facilities in a large state over a 6-month period. Patients who had recently
suffered from a heart attack and who were entering rehabilitation agreed to
be randomly assigned to either a new exercise treatment program or to the
standard treatment protocol. Of particular interest to the researcher heading
up this study is the relationship between treatment condition and the number
of cardiac warning incidents. The new approach to rehabilitation is expected
to result in fewer such incidents as compared to the traditional method. In
addition, the researcher has collected data on the sex of the patients, and the
number of hours that each rehabilitation facility is open during the week. This
latter variable is of interest as it reflects the overall availability of the
rehabilitation programs. The new method of conducting cardiac rehabilita-
tion is coded in the data as 1, while the standard approach is coded as 0. Males
are also coded as 1, while females are assigned a value of 0.
summary(model8.7<-glmer(heart~trt+sex+(1|rehab),family=poisson,
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
−6.190 −1.669 −0.879 0.747 43.449
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 1.221 1.105
Number of obs: 1000, groups: rehab, 110
154 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.03594 0.11044 9.380 <2e−16 ***
trt −0.45548 0.03389 -13.441 <2e−16 ***
sex 0.14121 0.01623 8.702 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In terms of the function call, the syntax for Model 8.7 is virtually identical to
that used for the dichotomous logistic regression model. The dependent and
independent variables are linked in the usual way that we have seen in
R: heart~trt+sex. Here, the outcome variable is heart, which reflects
the frequency of the warning signs for heart problems that we described
above. The independent variables are treatment (trt) and sex of the
individual, while the specific rehabilitation facility is contained in the
variable rehab. In this model, we are fitting a random intercept only, with
no random slope and no rehabilitation center-level variables.
The results of the analysis indicate that there is variation among the
intercepts from rehabilitation facility to rehabilitation facility, with a
variance of 1.221. As a reminder, the intercept reflects the mean frequency
of events when (in this case) both of the independent variables are 0, i.e.,
females in the control condition. The average intercept across the 110
rehabilitation centers is 1.036, and this non-zero value suggests that the
intercept does differ from center to center. Put another way, we can
conclude that the mean number of cardiac warning signs varies across
rehabilitation centers, and that the average female in the control condition
will have approximately 1.04 such incidents over the course of 6 months. In
addition, these results reveal a statistically significant negative relationship
between heart and trt, and a statistically significant positive relationship
between heart and sex. Remember that the new treatment is coded as 1
and the control as 0, so that a negative relationship indicates that there are
fewer warning signs over 6 months for those in the treatment than those in
the control group. Also, given that males were coded as 1 and females as 0,
the positive slope for sex means that males have more warning signs on
average than do females.
We can calculate the 95% confidence intervals for the random and fixed
effects using the familiar confint command in R.
2.5% 97.5%
.sig01 0.9244246 1.3201511
(Intercept) 0.8779450 1.1649125
trt −0.5061985 −0.4167314
sex 0.1250211 0.1622336
Of particular interest is the interval for the intercept variance, which is 0.92,
1.32. Because the interval excludes 0, we can conclude that the intercepts
vary across the rehab centers.
summary(model8.8<-glmer(heart~trt+sex+(trt|rehab),family=poisson,
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
−6.524 −1.568 −0.729 0.609 34.523
Random effects:
Groups Name Variance Std.Dev. Corr
rehab (Intercept) 1.857 1.363
trt 1.862 1.365 −0.61
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.70822 0.13950 5.077 3.84e−07 ***
trt −0.12705 0.14809 −0.858 0.391
sex 0.10792 0.01741 6.198 5.70e−10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
156 Multilevel Modeling Using R
The syntax for the inclusion of random slopes in the model is identical to
that used with logistic regression and thus will not be commented on
further here. The random effect for slopes across rehabilitation centers was
estimated to be 1.862, indicating that there is some differential center effect
to the impact of treatment on the number of cardiac warning signs
experienced by patients. Indeed, the variance for the random slopes is
approximately the same magnitude as the variance for the random
intercepts, indicating that these two random effects are quite comparable
in magnitude. The correlation of the random slope and intercept model
components is fairly large and negative (−0.61), meaning that the greater the
number of cardiac events in a rehab center, the lower the impact of the
treatment on the number of such events. The average slope for treatment
across centers was no longer statistically significant, indicating that when
we account for the random coefficient effect for treatment, the treatment
effect itself goes away.
The confidence intervals for this model appear below.
All of the random effects were statistically significant (0 lies outside each
interval), meaning that the intercepts and treatment coefficients differ from
0 in the population, and the correlation between the two is also different
from 0.
As with the logistic regression, we can compare the fit of the two models
using both information indices, and a likelihood ratio test.
anova(model8.7,model8.8)
Data: rehab_data
Models:
model8.7: heart ~ trt + sex + (1 | rehab)
model8.8: heart ~ trt + sex + (trt | rehab)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
model8.7 4 11535 11554 −5763.4 11527
Multilevel Generalized Linear Models (MGLMs) 157
summary(model8.9<-glmer(heart~trt+sex+hours+(1|rehab),family=poisson,
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
−6.190 −1.669 −0.880 0.748 43.438
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 1.154 1.074
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.01105 0.10847 9.321 <2e−16 ***
trt −0.45545 0.03389 −13.440 <2e−16 ***
sex 0.14118 0.01623 8.701 <2e−16 ***
hours 0.24256 0.11171 2.171 0.0299 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
158 Multilevel Modeling Using R
These results show that the more hours a center is open, the more
warning signs patients who attend will experience over a 6-month
period. In other respects, the parameter estimates for Model 8.9 do
not differ substantially from those of the earlier models, generally
revealing similar relationships among the independent and dependent
variables.
We can also make comparisons among the various models in order to
determine which yields the best fit to our data. Given that the AIC and BIC
values for Model 8.8 are lower than those of either of the other two models,
Model 8.16, we would conclude that Model 8.8 yields the best fit to the data.
In addition, following are the results for the likelihood ratio tests comparing
these models with one another.
anova(model8.8,model8.9)
Data: rehab_data
Models:
model8.9: heart ~ trt + sex + hours + (1 | rehab)
model8.8: heart ~ trt + sex + (trt | rehab)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
model8.9 5 11532 11557 −5761.1 11522
model8.8 6 10612 10641 −5300.0 10600 922.33 1 < 2.2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model8.7,model8.9)
Data: rehab_data
Models:
model8.7: heart ~ trt + sex + (1 | rehab)
model8.9: heart ~ trt + sex + hours + (1 | rehab)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
model8.7 4 11535 11554 −5763.4 11527
model8.9 5 11532 11557 −5761.1 11522 4.565 1 0.03263 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These results show that Model 8.8 fits the data significantly better than
Model 8.9, which in turn fits the data significantly better than Model 8.7.
Earlier, we found that Model 8.8 also fit the data better than Model 8.7,
based on the likelihood ratio test results and AIC/BIC values. Thus, given
Multilevel Generalized Linear Models (MGLMs) 159
all of these results, we would conclude that Model 8.8 provides the best fit
to the data, from among the three that we tried here.
Recall that the signal quality of the Poisson distribution is the equality
of the mean and variance. In some instances, however, the variance of a
variable may be larger than the mean, leading to the problem of
overdispersion, which we described in Chapter 7. In the previous chapter,
we described alternative statistical models for such situations, including
one based on the quassipoisson distribution, which took the same form as
the Poisson, except that it relaxed the requirement of equal mean and
variance. It is possible to fit the quasipoisson distribution in the multilevel
modeling context as well, though not using lme4. The developer of lme4
is not confident in the quasipoisson fitting algorithm, and has thus
removed this functionality from lme4, though alternative estimators for
overdispersed data are available using lme4. Rather, we would need to
use the glmmPQL package from the MASS library. In this case, we would
use the following syntax for the random intercept model with the
quasipoisson estimator.
summary(model8.10<-
glmmPQL(heart~trt+sex,random=~1|rehab,family=quasipoisson,
data=rehab_data))
Random effects:
Formula: ~1 | rehab
(Intercept) Residual
StdDev: 0.6568469 3.970552
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: heart ~ trt + sex
Value Std.Error DF t-value p-value
(Intercept) 1.4731283 0.10644847 888 13.838886 0.0000
trt −0.4984139 0.13231951 888 −3.766745 0.0002
sex 0.1409343 0.06240835 888 2.258261 0.0242
Correlation:
(Intr) trt
trt −0.468
sex −0.050 −0.065
160 Multilevel Modeling Using R
summary(model8.11<-glmer.nb(heart~trt+sex+(1|rehab),
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
−0.4245 −0.4166 −0.4091 0.1201 10.7683
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 0.2084 0.4565
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.43389 0.15563 9.214 < 2e−16 ***
trt −0.53731 0.17476 −3.074 0.00211 **
Multilevel Generalized Linear Models (MGLMs) 161
The function call includes the standard model setup in R for the fixed effects
(trt, sex), with the random effect (intercept within school in this example)
denoted as for the glmer based models. In terms of the output, after the
function call, we see the table of parameter estimates, standard errors, test
statistics, and p-values. These results are similar to those described above,
indicating the significant relationships between the frequency of cardiac
warning signs and both treatment and sex. The variance associated with the
random effect was estimated to be 0.2084. The findings with respect to the
fixed effects are essentially the same as those for the standard Poisson
regression model, with a statistically significant negative relationship
between treatment and the number of cardiac events, and a significant
positive relationship for sex. The 95% profile confidence intervals for the
fixed and random effects in Model 8.11 appear below.
confint(model8.11, method=c("profile"))
From these results, we can see that 0 is not included in any of the intervals,
meaning that they are all statistically significant.
As with the Poisson regression model, it is possible to fit a random
coefficients model for the negative binomial, using very similar R syntax as
that for glmer. In this case, we will fit a random coefficient for the trt
variable, as we did for the Poisson regression model.
summary(model8.12<-glmer.nb(heart~trt+sex+(trt|rehab),
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
−0.4272 −0.4180 −0.4126 0.1379 8.7318
Random effects:
Groups Name Variance Std.Dev. Corr
rehab (Intercept) 0.3763 0.6134
trt 0.1116 0.3340 -1.00
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3216 0.1711 7.723 1.13e-14 ***
trt −0.3663 0.2011 −1.822 0.0685 .
sex 0.2144 0.0932 2.301 0.0214 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variance estimate for the random trt effect is 0.1116, as compared to
the larger random intercept variance estimate of 0.3763. This result suggests
that the differences in the mean number of cardiac events across rehab
centers are greater than the cross center differences in the treatment effect
on the number of events. The 95% percentile bootstrap confidence intervals
for the fixed and random effects appear below. Note that the profile
confidence interval approach did not converge, and thus wasn’t used here.
The intervals for the random intercept (.sig01), the random coefficient
(.sig03), the fixed intercept, and sex all excluded 0, meaning that these
terms can be seen as statistically significant. However, intervals for the
correlation between the random effects (.sig02), and the fixed treatment
effect (trt), all included 0. Thus, we would conclude that there is not a
statistically significant relationship between treatment condition and the
number of cardiac events, when the random treatment effect is included in
the model. This conclusion based on the confidence interval mirrors the
result of the hypothesis test in the original output for Model 8.12. There is,
however, a difference in the treatment effect across the rehab centers, given
the statistically significant random coefficient effect.
As with other models that we have examined in this book, it is possible to
include level-2 independent variables, such as the number of hours the
centers are open, in the model and compare the relative fit using the relative
fit indices, as in Model 8.13.
summary(model8.13<-glmer.nb(heart~trt+sex+hours
+(1|rehab), data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
−0.4232 −0.4161 −0.4085 0.1299 10.5007
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 0.1331 0.3649
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4162 0.1507 9.394 < 2e-16 ***
164 Multilevel Modeling Using R
As we have seen previously, the number of hours that the centers are open
is significantly positively related to the number of cardiac warning signs
over the 6-month period of the study. The 95% profile confidence intervals
for Model 8.13 appear below.
The 95% profile confidence intervals for the random and fixed effects in
Model 8.13 appear below. The interval for the random intercept effect
(.sig01) includes 0, leading us to conclude that under the negative
binomial model when the number of hours for which the center is open
is included in the model, there is not a statistically significant difference
among the rehab centers in terms of the number of cardiac events.
confint(model8.13, method=c("profile"))
Now that we have examined the random and fixed effects for each of the
three negative binomial models, we can compare their fit with one another
in order to determine which is optimal given the set of data at hand. We can
formally compare these three models using the likelihood ratio test, as
discussed next. In addition, we will compare the AIC and BIC values to
provide further evidence regarding the relative model fit.
anova(model8.11, model8.12)
Data: rehab_data
Models:
model8.11: heart ~ trt + sex + (1 | rehab)
Multilevel Generalized Linear Models (MGLMs) 165
anova(model8.11, model8.13)
Data: rehab_data
Models:
model8.11: heart ~ trt + sex + (1 | rehab)
model8.13: heart ~ trt + sex + hours + (1 | rehab)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
model8.11 5 3938.4 3962.9 −1964.2 3928.4
model8.13 6 3933.1 3962.5 −1960.5 3921.1 7.3128 1 0.006846 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, the results of the likelihood ratio test show that the fit of Model8.12 was
not significantly different than that of Model 8.11, meaning that inclusion of
the random treatment effect did not improve model fit. This finding is further
reinforced by the lower AIC and BIC values for Model 8.11. Next, we
compared the fit of models 8.11 and 8.13. Here, we see a statistically
significant difference in the model fit, with slightly lower AIC and BIC values
associated with Model 8.13, which included the number of hours that the
rehab centers were open. Thus, we would conclude that having this variable
in the model is associated with a better fit to the data. Finally, Models 8.12 and
8.13 are not nested within one another, and thus cannot be compared using
the likelihood ratio test. However, the AIC and BIC values for Model 8.13
were smaller than those of Model 8.12, suggesting that the inclusion of the
rehab center hours is more important to yielding a good fit to the data than is
inclusion of the random treatment condition effect.
Summary
In this chapter, we learned that the generalized linear models featured in
Chapter 7, which accommodate categorical dependent variables, can be
easily extended to the multilevel context. Indeed, the basic concepts that we
learned in Chapter 2 regarding sources of variation and various types of
models can be easily extended for categorical outcomes. In addition, R
provides for easy fitting of such models through the lme and lmer families
of functions. Therefore, in many ways, this chapter represents a review of
material that by now should be familiar to us, even while applied in a
166 Multilevel Modeling Using R
different scenario than we have seen up to now. Perhaps the most important
point to take away from this chapter is the notion that modeling multilevel
data in the context of generalized linear models is not radically different
from the normally distributed continuous dependent variable case, so that
the same types of interpretations can be made, and the same types of data
structure can be accommodated.
9
Bayesian Multilevel Modeling
library(MCMCglmm)
prime_time.nomiss<-na.omit(prime_time)
attach(prime_time.nomiss)
model9.1<-MCMCglmm(geread~gevocab, random=~school,
data=prime_time.nomiss)
Bayesian Multilevel Modeling 171
plot(model9.1)
summary(model9.1)
The function call for MCMCglmm is fairly similar to what we have seen in
previous chapters. One important point to note is that MCMCglmm does not
accommodate the presence of missing data. Therefore, before conducting
the analysis, we needed to expunge all of the observations with missing
data. We created a dataset with no missing observations using the
command prime_time.nomiss<-na.omit(prime_time), which cre-
ated a new dataframe called prime_time.nomiss containing no missing
data. We then attached this dataframe and fit the multilevel model,
indicating the random effect with the random=~ statement. We subse-
quently requested a summary of the results and a plot of the relevant
graphs that will be used in determining whether the Bayesian model has
converged properly. It is important to note that by default, MCMCglmm uses
13,000 iterations of the MCMC algorithm, with a burn in of 3000 and
thinning of 10. As we will see next, we can easily adjust these settings to
best suit our specific analysis problem.
When interpreting the results of the Bayesian analysis, we first want to
know whether we can be confident in the quality of the parameter estimates
for both the fixed and random effects. The plots relevant to this diagnosis
appear below.
0.52 30
0.51 20
0.50 10
0.49 0
4000 6000 8000 10000 12000 0.49 0.50 0.51 0.52 0.53 0.54 0.55
Iterations N = 1000 Bandwidth = 0.002201
172 Multilevel Modeling Using R
0.16 20
0.14
15
0.12
0.10 10
0.08 5
0.06
0
4000 6000 8000 10000 12000 0.05 0.10 0.15
Iterations N = 1000 Bandwidth = 0.00467
Trace of units Density of units
3.90
3.85 6
3.80
4
3.75
3.70 2
3.65
3.60 0
4000 6000 8000 10000 12000 3.6 3.7 3.8 3.9
Iterations N = 1000 Bandwidth = 0.01384
For each model parameter, we have the trace plot on the left, showing
the entire set of parameter estimates as a time series across the 13,000
iterations. On the right, we have a histogram of the distribution of
parameter estimates. Our purpose for examining these plots is to ascertain
to what extent the estimates have converged on a single value. As an
example, the first pair of graphs reflects the parameter estimates for the
intercept. For the trace, convergence is indicated when the time series plot
hovers around a single value on the y-axis and does not meander up and
down. In this case, it is clear that the trace plot for the intercept shows
convergence. This conclusion is reinforced by the histogram for the
estimate, which is clearly centered over a single mean value, with no
bimodal tendencies. We see similar results for the coefficient of vocabu-
lary, the random effect of school, and the residual. Given that the
parameter estimates appear to have successfully converged, we can
have confidence in the actual estimated values, which we will examine
shortly.
Prior to looking at the parameter estimates, we want to assess the
autocorrelation of the estimates in the time series for each parameter.
Our purpose here is to ensure that the rate of thinning (taking every
10th observation generated by the MCMC algorithm) that we used is
sufficient to ensure that any autocorrelation in the estimates is elimi-
nated. In order to obtain the autocorrelations for the random effects, we
use the command autocorr(model9.1$VCV), and obtain the fol-
lowing results.
Bayesian Multilevel Modeling 173
, , school
school units
Lag 0 1.00000000 −0.05486644
Lag 10 −0.03926722 −0.03504799
Lag 50 −0.01636431 −0.04016879
Lag 100 −0.03545104 0.01987726
Lag 500 0.04274662 −0.05083669
, , units
school units
Lag 0 −0.0548664421 1.000000000
Lag 10 −0.0280445140 −0.006663408
Lag 50 −0.0098424151 0.017031804
Lag 100 0.0002654196 0.010154987
Lag 500 −0.0022835508 0.046769152
We read this table as follows: In the first section, we see results for the
random effect school. This output includes correlations involving
the school variance component estimates. Under the school column
are the actual autocorrelations for the school random effect estimate.
Under the units column are the cross-correlations between estimates for
the school random effect and the residual random effect, at different lags.
Thus, for example, the correlation between the estimates for school and
the residual with no lag is −0.0549. The correlation between the school
estimate ten lags prior to the current residual estimate is −0.035. In terms of
ascertaining whether our rate of thinning is sufficient, the more important
numbers are in the school column, where we see the correlation between a
given school effect estimate and the school effect estimate 10, 50, 100, and
500 estimates before. The autocorrelation at a lag value of 10, −0.0393, is
sufficiently small for us to have confidence in our thinning the results at 10.
We would reach a similar conclusion regarding the autocorrelation of the
residual (units), such that 10 appears to be a reasonable thinning value for
it as well. We can obtain the autocorrelations of the fixed effects using the
command autocorr(model9.1$Sol). Once again, it is clear that there is
essentially no autocorrelation as far out as a lag of 10, indicating that the
default thinning value of 10 is sufficient for both the intercept and the
vocabulary test score.
, , (Intercept)
(Intercept) gevocab
Lag 0 1.000000000 −0.757915532
Lag 10 −0.002544175 −0.013266125
Lag 50 −0.019405970 0.007370979
Lag 100 −0.054852949 0.029253018
174 Multilevel Modeling Using R
, , gevocab
(Intercept) gevocab
Lag 0 −0.757915532 1.000000000
Lag 10 0.008583659 0.020942660
Lag 50 −0.001197203 −0.002538901
Lag 100 0.047596351 −0.022549594
Lag 500 −0.057219532 0.026075911
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 43074.14
G-structure: ~school
R-structure: ~units
We are first given information about the number of iterations, the thinning
interval, and the final number of MCMC values that were sampled (Sample
size) and used to estimate the model parameters. Next, we have the model
fit index, the DIC, which can be used for comparing various models and
selecting the one that provides optimal fit. The DIC is interpreted in much
the same fashion as the AIC and BIC, which we discussed in earlier
Bayesian Multilevel Modeling 175
chapters, and for which smaller values indicate better model fit. We are then
provided with the posterior mean of the distribution for each of the random
effects, school and residual, which MCMCglmm refers to as units. The mean
variance estimate for the school random effect is 0.09962, with a 95%
credibility interval of 0.06991 to 0.1419. Remember that we interpret
credibility intervals in Bayesian modeling in much the same way that we
interpret confidence intervals in frequentist modeling. This result indicates
that reading achievement scores do differ across schools, because 0 is not in
the interval. Similarly, the residual variance also differs from 0. With regard
to the fixed effect of vocabulary score, which had a mean posterior value of
0.5131, we also conclude that the results are statistically significant, given
that 0 is not in its 95% credibility interval. We also have a p-value for this
effect, and the intercept, both of which are significant with values less than
0.05. The positive value of the posterior mean indicates that students with
higher vocabulary scores also had higher reading scores.
In order to demonstrate how we can change the number of iterations, the
burn in period, and the rate of thinning in R, we will reestimate Model 9.1
with 100,000 iterations, a burn in of 10,000, and a thinning rate of 50. This
will yield 1,800 samples for the purposes of estimating the posterior
distribution for each model parameter. The R commands for fitting this
model, followed by the relevant output, appear below.
model9.1b<-MCMCglmm(geread~gevocab, random=~school,
data=prime_time.nomiss, nitt=100000, thin=50, burnin=10000)
plot(model9.1b)
summary(model9.1b)
0.53 40
0.52 30
0.51 20
0.50 10
0.49 0
2e+04 4e+04 6e+04 8e+04 1e+05 0.48 0.49 0.50 0.51 0.52 0.53 0.54
Iterations N = 1800 Bandwidth = 0.001966
176 Multilevel Modeling Using R
0.16 20
0.14 15
0.12
10
0.10
0.08 5
0.06 0
2e+04 4e+04 6e+04 8e+04 1e+05 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
Iterations N = 1800 Bandwidth = 0.004352
3.9 6
3.8 4
3.7 2
3.6 0
2e+04 4e+04 6e+04 8e+04 1e+05 3.6 3.7 3.8 3.9 4.0
Iterations N = 1800 Bandwidth = 0.01261
As with the initial model, all parameter estimates appear to have success-
fully converged. The results, in terms of the posterior means, are also very
similar to what we obtained using the default values for the number of
iterations, the burn in period, and the thinning rate. This result is not
surprising, given that the diagnostic information for our initial model was
all very positive. Nonetheless, it was useful for us to see how the default
values can be changed if we need to do so.
Iterations = 10001:99951
Thinning interval = 50
Sample size = 1800
DIC: 43074.19
G-structure: ~school
R-structure: ~units
model9.2<-MCMCglmm(geread~gevocab+senroll, random=~school,
data=prime_time.nomiss)
plot(model9.2)
0.06 0
4000 6000 8000 10000 12000 0.05 0.10 0.15 0.20
Iterations N = 1000 Bandwidth = 0.005011
autocorr(model9.2$VCV)
, , school
school units
Lag 0 1.000000000 −0.05429139
Lag 10 −0.002457293 −0.07661475
Lag 50 −0.020781555 −0.01761532
Lag 100 −0.027670953 0.01655270
Lag 500 0.035838857 −0.03714127
, , units
school units
Lag 0 −0.05429139 1.000000000
Lag 10 0.03284220 −0.004188523
Lag 50 0.02396060 −0.043733590
Lag 100 −0.04543941 −0.017212479
Lag 500 −0.01812893 0.067148463
autocorr(model9.2$Sol)
, , (Intercept)
, , gevocab
, , senroll
The summary results for the model with 40,000 iterations, and a thinning
rate of 100 appear below. It should be noted that the trace plots and
histograms of parameter estimates for Model 9.2 indicated that conver-
gence had been attained. From these results, we can see that the overall fit,
based on the DIC, is virtually identical to that of the model not including
senroll. In addition, the posterior mean estimate and associated 95%
credible interval for this parameter show that senroll was not statistically
significantly related to reading achievement, i.e., 0 is in the interval. Taken
together, we would conclude that school size does not contribute signifi-
cantly to the variation in reading achievement scores, nor to the overall fit of
the model.
summary(model9.2)
Iterations = 3001:39901
Thinning interval = 100
Sample size = 1700
DIC: 43074.86
G-structure: ~school
R-structure: ~units
model9.3<-MCMCglmm(geread~gevocab, random=~school+gevocab,
data=prime_time.nomiss)
plot(model9.3)
summary(model9.3)
2.2 3.0
2.0 2.5
2.0
1.8
1.5
1.6 1.0
0.5
1.4
0.0
4000 6000 8000 10000 12000 1.2 1.4 1.6 1.8 2.0 2.2
Iterations N = 1000 Bandwidth = 0.03298
15
0.60
10
0.55
5
0.50
0
4000 6000 8000 10000 12000 0.50 0.55 0.60 0.65
Iterations N = 1000 Bandwidth = 0.006197
Bayesian Multilevel Modeling 181
autocorr(model9.3$VCV)
, , school
, , gevocab
, , units
autocorr(model9.3$Sol)
, , (Intercept)
(Intercept) gevocab
Lag 0 1.00000000 −0.86375013
Lag 10 −0.01675809 0.01808335
Lag 50 −0.01334607 0.03583885
Lag 100 0.02850369 −0.01102134
Lag 500 0.03392102 −0.04280691
, , gevocab
(Intercept) gevocab
Lag 0 −0.863750126 1.0000000000
Lag 10 0.008428317 0.0008246964
Lag 50 0.007928161 −0.0470879801
Lag 100 −0.029552813 0.0237866610
Lag 500 −0.029554289 0.0425010354
The trace plots and histograms, as well as the autocorrelations, indicate that
the parameter estimation has converged properly and that the thinning rate
appears to be satisfactory for removing autocorrelation from the estimate
values. The model results appear below. First, we should note that the DIC
for this random coefficients model is smaller than that of the random
intercepts only models above. In addition, the estimate of the random
coefficient for vocabulary is 0.2092, with a 95% credible interval of
0.135–0.3025. Because this interval does not include 0, we can conclude
that the random coefficient is indeed different from 0 in the population, and
that the relationship between reading achievement and vocabulary test
score varies from one school to another.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 42663.14
G-structure: ~school
~gevocab
R-structure: ~units
User-Defined Priors
Finally, we need to consider the situation in which the user would like to
provide her or his own prior distribution information, rather than rely on
the defaults established in MCMCglmm. To do so, we will make use of the
prior command. In this example, we examine the case where the
researcher has informative priors for one of the model parameters. As an
example, let us assume that a number of studies in the literature report
finding a small but consistent positive relationship between reading
achievement and a measure of working memory. In order to incorporate
this informative prior into a model relating these two variables, while also
including the vocabulary score, and accommodating the random coefficient
for this variable, we would first need to define our prior, as discussed next.
Step one in this process is to create the covariance matrix (var) containing
the prior of the fixed effects in the model (intercept and memory). In this
case, we set the prior variances of the intercept and the coefficient for
memory to 1 and 0.1, respectively. We select a fairly small variance for
the working memory coefficient because we have much prior evidence
in the literature regarding the anticipated magnitude of this relationship. In
addition, we will need to set the priors for the error (R) and random
intercept terms (G). The inverse Wishart distribution variance structure is
used, and here we set the value at 1 with a certainty parameter of
nu=0.002.
autocorr(model9.4$VCV)
autocorr(model9.4$Sol)
summary(model9.4)
autocorr(model9.4$VCV)
, , school
school units
Lag 0 1.000000000 −0.019548306
Lag 10 0.046970940 −0.001470008
Lag 50 −0.014670119 −0.051306845
Lag 100 0.020042317 0.013675599
Lag 500 −0.005250327 0.028681171
, , units
school units
Lag 0 −0.01954831 1.00000000
Lag 10 −0.01856012 0.03487098
Lag 50 0.05694637 0.01137949
Trace of (Intercept) Density of (Intercept
3.8 6
5
3.7
4
3.6
3
3.5 2
1
3.4
0
4000 6000 8000 10000 3.3 3.4 3.5 3.6 3.7 3.8
Iterations N = 1000 Bandwidth = 0.01769
Trace of npamem Density of npamem
500
0.014 400
300
0.012 200
100
0.010 0
4000 6000 8000 10000 0.010 0.012 0.014
Iterations N = 1000 Bandwidth = 0.0002136
Bayesian Multilevel Modeling 185
6
0.45
4
0.35
2
0.25
0
4000 6000 8000 10000 0.2 0.3 0.4 0.5 0.6
Iterations N = 1000 Bandwidth = 0.01316
Trace of units Density of units
5
5.1
4
5.0
3
4.9
2
4.8 1
4.7 0
4000 6000 8000 10000 4.7 4.8 4.9 5.0 5.1 5.2
Iterations N = 1000 Bandwidth = 0.01852
autocorr(model9.4$Sol)
, , (Intercept)
(Intercept) npamem
Lag 0 1.00000000 −0.640686295
Lag 10 0.02871714 0.004803420
Lag 50 −0.03531602 0.011157302
Lag 100 0.01483541 −0.040542950
Lag 500 -0.01551577 0.006384573
, , npamem
(Intercept) npamem
Lag 0 −0.6406862955 1.000000000
Lag 10 −0.0335729209 −0.022385089
Lag 50 0.0229034652 −0.002681217
Lag 100 0.0007594231 0.008694124
Lag 500 0.0311681203 −0.015291965
186 Multilevel Modeling Using R
The summary of the model fit results appear below. Of particular interest is
the coefficient for the fixed effect working memory (npamem). The posterior
mean is 0.01266, with a credible interval ranging from 0.01221 to 0.01447,
indicating that the relationship between working memory and reading
achievement is statistically significant. It is important to note, however, that
the estimate of this relationship for the current sample is well below that
reported in prior research, and which was incorporated into the prior
distribution. In this case, because the sample is so large, the effect of the
prior on the posterior distribution is very small. The impact of the prior
would be much greater were we working with a smaller sample.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 45908.05
G-structure: ~school
R-structure: ~units
As a point of comparison, we also fit the model using the default priors in
MCMCglmm to see what impact the informative priors had on the posterior
distribution. We will focus only on the coefficients for this demonstration,
given that they are the focus of the informative priors. For the default priors,
we obtained the following results.
mathfinal.nomiss<-na.omit(mathfinal)
model9.5<-MCMCglmm(score2~numsense, random=~school,
family=“ordinal”, data=mathfinal,)
plot(model9.5)
autocorr(model9.5$VCV)
autocorr(model9.5$Sol)
summary(model9.5)
The default prior parameters are used, and the family is defined as ordinal. In
other respects, the function call is identical to that for the continuous outcome
variables that were the focus of the earlier part of this chapter. The output
from R appears below. From the trace plots and histograms, we can see that
convergence was achieved for each of the model parameters, and the
autocorrelations show that our rate of thinning is sufficient.
, , school
school units
Lag 0 1.000000000 0.24070410
Lag 10 0.016565749 0.02285168
Lag 50 0.012622856 0.02073446
Lag 100 0.007855806 0.02231629
Lag 500 0.007233911 0.01822021
188 Multilevel Modeling Using R
-1.5
6
-1.6
4
-1.7
2
-1.8
0
4000 6000 8000 10000 12000 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4
Iterations N = 1000 Bandwidth = 0.01492
Trace of goal1ritscoref10 Density of goal1ritscoref10
0.0115
1500
0.0110 1000
500
0.0105
0
4000 6000 8000 10000 12000 0.0100 0.0105 0.0110 0.0115
Iterations N = 1000 Bandwidth = 6.007e-05
0.10
30
0.08
20
0.06
0.04 10
0.02 0
4000 6000 8000 10000 12000 0.02 0.04 0.06 0.08 0.10 0.12
Iterations N = 1000 Bandwidth = 0.003047
Trace of units Density of units
0.185
120
0.180 100
80
0.175
60
0.170 40
20
0.165
0
4000 6000 8000 10000 12000 0.165 0.170 0.175 0.180 0.185
Iterations N = 1000 Bandwidth = 0.0007896
, , units
school units
Lag 0 0.24070410 1.00000000
Lag 10 0.02374442 0.00979023
Lag 50 0.02015865 0.00917857
Lag 100 0.01965188 0.00849276
Lag 500 0.01361470 0.00459030
Bayesian Multilevel Modeling 189
, , (Intercept)
(Intercept) numsense
Lag 0 1.00000000 −0.09818969
Lag 10 0.00862290 −0.00878574
Lag 50 0.00688767 −0.00707115
Lag 100 0.00580816 −0.00603118
Lag 500 0.00300539 −0.00314349
, , numsense
(Intercept) numsense
Lag 0 −0.09818969 1.0000000
Lag 10 −0.00876214 0.00894084
Lag 50 −0.00704441 0.00723130
Lag 100 −0.00594502 0.00618679
Lag 500 −0.00315547 0.00328528
In terms of model parameter estimation results, the number sense score was
found to be statistically significantly related to whether or not a student received
a passing score on the state mathematics assessment. The posterior mean for the
coefficient is 0.04544, indicating that the higher an individual's number sense
score, the greater the likelihood that (s)he will pass the state assessment.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 6333.728
G-structure: ~school
R-structure: ~units
Exactly the same command sequence that was used here would also be used
to fit a model for an ordinal variable with more than two categories.
attach(heartdata)
model9.6<-MCMCglmm(heart~trt+sex, random=~rehab,
family=“poisson”, data=heartdata)
plot(model9.6)
autocorr(model9.6$VCV)
autocorr(model9.6$Sol)
summary(model9.6)
, , rehab
rehab units
Lag 0 1.000000000 −0.0117468496
Lag 10 0.004869176 −0.0067184848
Lag 50 0.000957586 0.0009480950
Lag 100 0.009502289 0.0004500062
Lag 500 −0.009067234 0.0028298115
, , units
rehab units
Lag 0 -0.00117468 1.00000000000
Lag 10 -0.00076201 0.00425938977
Lag 50 0.00013997 0.00065406398
Lag 100 0.00035229 0.00079448090
Lag 500 0.00024450 0.00011262469
, , (Intercept)
, , trt
, , sex
An examination of the trace plots and histograms shows that the parameter
estimation converged appropriately. In addition, the autocorrelations are
Bayesian Multilevel Modeling 193
sufficiently small for each of the parameters so that we can have confidence
in our rate of thinning. Therefore, we can move to a discussion of the model
parameter estimates, which appear below.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 2735.293
G-structure: ~rehab
R-structure: ~units
In terms of the primary research question, the results indicate that the
frequency of cardiac risk signs was lower among those in the treatment
condition than those in the control, when accounting for the participants’
sex. In addition, there was a statistically significant difference in the rate of
risk symptoms between males and females. With respect to the random
effects, the variance in the outcome variable due to rehabilitation facility, as
well as the residual, was both significantly different from 0. The posterior
mean effect of the rehab facility was 0.5414, with a 95% credibility interval
of 0.1022 to 1.009. This result indicates that symptom frequency does differ
among the facilities.
We may also be interested in examining a somewhat more complex
explanation of the impact of treatment on the rate of cardiac symptoms. For
instance, there is evidence from previous research that the number of hours
the facilities are open may impact the frequency of cardiac symptoms, by
providing more, or less, opportunity for patients to make use of their
services. In turn, if more participation in rehabilitation activities is
associated with the frequency of cardiac risk symptoms, we might expect
194 Multilevel Modeling Using R
model9.7<-MCMCglmm(heart~trt+sex+hours, random=~rehab+trt,
family=“poisson”, data=heartdata)
plot(model9.7)
autocorr(model9.7$VCV)
autocorr(model9.7$Sol)
summary(model9.7)
Bayesian Multilevel Modeling 195
, , rehab
, , trt
, , units
, , trt
, , sex
, , hours
The trace plots and histograms reveal that estimation converged for each of
the parameters estimated in the analysis, and the autocorrelations of
estimates are small. Thus, we can move on to the interpretation of the
parameter estimates. The results of the model fitting revealed several
interesting patterns. First, the random coefficient term for treatment
was statistically significant, given that the credible interval ranged between
5.421 and 7.607, and did not include 0. Thus, we can conclude that the impact
of treatment on the number of cardiac symptoms differs from one rehabilita-
tion center to the next. In addition, the variance in the outcome due to
rehabilitation center was different from 0, given that the confidence interval
for rehab was 0.1953–1.06. Finally, treatment and sex were negatively
statistically significantly related to the number of cardiac symptoms, as
they were for Model 9.7, and the centers’ hours of operation were not related
to the number of cardiac symptoms.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 2731.221
G-structure: ~rehab
~trt
R-structure: ~units
Summary
The material presented in Chapter 9 represents a marked departure from
that presented in the first eight chapters of the book. In particular, methods
presented in the earlier chapters were built upon a foundation of ML
estimation. Bayesian modeling, which is the focus of Chapter 9, leaves
likelihood-based analyses behind, and instead relies on MCMC to derive a
posterior distribution for model parameters. More fundamentally, however,
Bayesian statistics is radically different from likelihood-based frequentist
statistics in terms of the way in which population parameters are estimated.
In the latter, they take single values, whereas Bayesian estimation estimates
population parameters as distributions, so that the sample-based estimate of
the parameter is the posterior distribution obtained using MCMC, and not a
single value calculated from the sample.
Beyond the more theoretical differences between the methods described
in Chapter 9 and those presented earlier, there are also very real differences
in terms of application. Analysts using Bayesian statistics are presented
with what is, in many respects, a more flexible modeling paradigm that
does not rest on the standard assumptions that must be met for the
successful use of ML such as normality. At the same time, this greater
flexibility comes at the cost of greater complexity in estimating the model.
From a very practical viewpoint, consider how much more is involved
when conducting the analyses featured in Chapter 9 as compared to those
198 Multilevel Modeling Using R
As with the other data analysis paradigms that we have described in this
book (e.g., linear models), latent variable models can also be framed in the
multilevel context. For example, researchers may collect quality-of-life data
from cancer patients visiting several different clinics, with an interest in
modeling the factor structure of these scales. Given that the data were
collected from clustered groups of individuals (based on clinic), the standard
approaches to modeling them (e.g., EFA or CFA), which assume indepen-
dence of the observations, must be adapted to account for the additional
structure in the data. Similarly, researchers may wish to identify a small
number of latent classes in the population based on the quality of life scale
responses. Again, using the standard method for fitting LCA models that is
based on an assumption of local independence would be inappropriate in this
case, because the observations are not independent of one another. However,
multilevel LCA modeling of such data is an option and can be effected
relatively easily using R, as we will see below. The purpose of this chapter is
to describe the fitting of multilevel latent variable models, including factor
analysis and LCA. Though we endeavor to present a wide array of such
models, we also recognize that in one chapter we will only be able to touch on
the main points. The interested reader is encouraged to investigate the more
advanced aspects of multilevel latent variable modeling through the refer-
ences that appear next.
where
= W + B (10.2)
where
= + (10.3)
where
Likewise, the within and between covariance matrices for the indicators
shown in (10.2) can also be described using the appropriate (within or
between) factor model parameters. For example, W is a function of within-
cluster loadings, variances, and covariances.
W = W W W + W (10.4)
The terms in (10.4) are identical to those in (10.3), except that they are at the
within-cluster level. Similarly, we can express the between cluster covariance
matrix in terms of between cluster level factor model parameters.
B = B B B + B (10.5)
Together, equations (10.4) and (10.5) imply that equation (10.1) can be
extended to the multilevel framework by combining the within and
between portions of the covariance matrix as in (10.6).
The terms in (10.6) are as defined previously, with j indicating cluster (level-2)
membership. We can see that the level-1 intercepts ( W ) are set to 0 because the
value of y for individual i in cluster j is due to a unique deviation from
the overall cluster c mean for that indicator. In addition, the multilevel FA
model has associated with it factor loadings at both levels of the data. In other
words, there will be factor loadings at the cluster level ( B), as well as at the
individual level ( Wij ). Likewise, factor scores are also estimated at both
levels, as are the error terms.
One additional aspect of multilevel FA modeling that is implied by
equations (10.4) and (10.5) is the decomposition of the total factor variance
( ) into the portion due to variation between clusters ( B ), and the portion
due to variation within clusters ( W ). The intraclass correlation (ICC) for the
indicators can be calculated using these variances as
B
(10.7)
B + W
202 Multilevel Modeling Using R
model10.1a<-'
level: 1
Multilevel Latent Variable Models 203
V1~~0*V2
V1~~0*V3
V1~~0*V4
V1~~0*V5
V1~~0*V6
V1~~0*V7
V1~~0*V8
V1~~0*V9
V1~~0*V10
V1~~0*V11
V1~~0*V12
V2~~0*V3
V2~~0*V4
V2~~0*V5
V2~~0*V6
V2~~0*V7
V2~~0*V8
V2~~0*V9
V2~~0*V10
V2~~0*V11
V2~~0*V12
V3~~0*V4
V3~~0*V5
V3~~0*V6
V3~~0*V7
V3~~0*V8
V3~~0*V9
V3~~0*V10
V3~~0*V11
V3~~0*V12
V4~~0*V5
V4~~0*V6
V4~~0*V7
V4~~0*V8
V4~~0*V9
V4~~0*V10
V4~~0*V11
V4~~0*V12
V5~~0*V6
V5~~0*V7
V5~~0*V8
V5~~0*V9
V5~~0*V10
V5~~0*V11
V5~~0*V12
204 Multilevel Modeling Using R
V6~~0*V7
V6~~0*V8
V6~~0*V9
V6~~0*V10
V6~~0*V11
V6~~0*V12
V7~~0*V8
V7~~0*V9
V7~~0*V10
V7~~0*V11
V7~~0*V12
V8~~0*V9
V8~~0*V10
V8~~0*V11
V8~~0*V12
V9~~0*V10
V9~~0*V11
V9~~0*V12
V10~~0*V11
V10~~0*V12
V11~~0*V12
level: 2
V1~~V2
V1~~V3
V1~~V4
V1~~V5
V1~~V6
V1~~V7
V1~~V8
V1~~V9
V1~~V10
V1~~V11
V1~~V12
V2~~V3
V2~~V4
V2~~V5
V2~~V6
V2~~V7
V2~~V8
V2~~V9
V2~~V10
V2~~V11
V2~~V12
V3~~V4
Multilevel Latent Variable Models 205
V3~~V5
V3~~V6
V3~~V7
V3~~V8
V3~~V9
V3~~V10
V3~~V11
V3~~V12
V4~~V5
V4~~V6
V4~~V7
V4~~V8
V4~~V9
V4~~V10
V4~~V11
V4~~V12
V5~~V6
V5~~V7
V5~~V8
V5~~V9
V5~~V10
V5~~V11
V5~~V12
V6~~V7
V6~~V8
V6~~V9
V6~~V10
V6~~V11
V6~~V12
V7~~V8
V7~~V9
V7~~V10
V7~~V11
V7~~V12
V8~~V9
V8~~V10
V8~~V11
V8~~V12
V9~~V10
V9~~V11
V9~~V12
V10~~V11
V10~~V12
V11~~V12
'
206 Multilevel Modeling Using R
We then fit the model using the following command. The results are saved
in the object model10.1a.fit. The function needed to fit the data is sem.
We must indicate the model to be fit, the estimator to be used (maximum
likelihood in this case), the dataframe to use, and the clustering variable
(V13). We also force the model to converge for pedagogical purposes using
the command optim.force.converged=TRUE. We then use the summary
command to obtain the results. By including fit.measures=TRUE and
standardized=TRUE, we can obtain the model fit statistics and the
standardized model parameter estimates.
Similarly, we need to obtain the Chi-square fit statistic for the baseline level-
2 model, which is done using the following commands where we freely
estimate the correlations at level 1 and constrain the level-2 correlations
to be 0.
model10.1b<-'
level: 1
V1~~V2
V1~~V3
V1~~V4
V1~~V5
V1~~V6
V1~~V7
V1~~V8
V1~~V9
V1~~V10
V1~~V11
V1~~V12
V2~~V3
V2~~V4
V2~~V5
V2~~V6
V2~~V7
Multilevel Latent Variable Models 207
V2~~V8
V2~~V9
V2~~V10
V2~~V11
V2~~V12
V3~~V4
V3~~V5
V3~~V6
V3~~V7
V3~~V8
V3~~V9
V3~~V10
V3~~V11
V3~~V12
V4~~V5
V4~~V6
V4~~V7
V4~~V8
V4~~V9
V4~~V10
V4~~V11
V4~~V12
V5~~V6
V5~~V7
V5~~V8
V5~~V9
V5~~V10
V5~~V11
V5~~V12
V6~~V7
V6~~V8
V6~~V9
V6~~V10
V6~~V11
V6~~V12
V7~~V8
V7~~V9
V7~~V10
V7~~V11
V7~~V12
V8~~V9
V8~~V10
V8~~V11
V8~~V12
V9~~V10
208 Multilevel Modeling Using R
V9~~V11
V9~~V12
V10~~V11
V10~~V12
V11~~V12
level: 2
V1~~0*V2
V1~~0*V3
V1~~0*V4
V1~~0*V5
V1~~0*V6
V1~~0*V7
V1~~0*V8
V1~~0*V9
V1~~0*V10
V1~~0*V11
V1~~0*V12
V2~~0*V3
V2~~0*V4
V2~~0*V5
V2~~0*V6
V2~~0*V7
V2~~0*V8
V2~~0*V9
V2~~0*V10
V2~~0*V11
V2~~0*V12
V3~~0*V4
V3~~0*V5
V3~~0*V6
V3~~0*V7
V3~~0*V8
V3~~0*V9
V3~~0*V10
V3~~0*V11
V3~~0*V12
V4~~0*V5
V4~~0*V6
V4~~0*V7
V4~~0*V8
V4~~0*V9
V4~~0*V10
V4~~0*V11
V4~~0*V12
Multilevel Latent Variable Models 209
V5~~0*V6
V5~~0*V7
V5~~0*V8
V5~~0*V9
V5~~0*V10
V5~~0*V11
V5~~0*V12
V6~~0*V7
V6~~0*V8
V6~~0*V9
V6~~0*V10
V6~~0*V11
V6~~0*V12
V7~~0*V8
V7~~0*V9
V7~~0*V10
V7~~0*V11
V7~~0*V12
V8~~0*V9
V8~~0*V10
V8~~0*V11
V8~~0*V12
V9~~0*V10
V9~~0*V11
V9~~0*V12
V10~~0*V11
V10~~0*V12
V11~~0*V12
'
Continuing with Stapleton’s steps for obtaining fit statistics at each level, we
now must fit a model that is saturated at level 2 (thereby yielding a perfect
210 Multilevel Modeling Using R
fit at that level), with the posited model appearing at level 1. This model is
fit using the following:
model10.2a<-'
level: 1
f1w=~V1+V2+V3+V4
f2w=~V5+V6+V7+V8
f3w=~V9+V10+V11+V12
level: 2
V1~~V2
V1~~V3
V1~~V4
V1~~V5
V1~~V6
V1~~V7
V1~~V8
V1~~V9
V1~~V10
V1~~V11
V1~~V12
V2~~V3
V2~~V4
V2~~V5
V2~~V6
V2~~V7
V2~~V8
V2~~V9
V2~~V10
V2~~V11
V2~~V12
V3~~V4
V3~~V5
V3~~V6
V3~~V7
V3~~V8
V3~~V9
V3~~V10
V3~~V11
V3~~V12
V4~~V5
V4~~V6
V4~~V7
V4~~V8
Multilevel Latent Variable Models 211
V4~~V9
V4~~V10
V4~~V11
V4~~V12
V5~~V6
V5~~V7
V5~~V8
V5~~V9
V5~~V10
V5~~V11
V5~~V12
V6~~V7
V6~~V8
V6~~V9
V6~~V10
V6~~V11
V6~~V12
V7~~V8
V7~~V9
V7~~V10
V7~~V11
V7~~V12
V8~~V9
V8~~V10
V8~~V11
V8~~V12
V9~~V10
V9~~V11
V9~~V12
V10~~V11
V10~~V12
V11~~V12
'
Using an equation described by Ryu and West (2009), we can calculate the
Comparative Fit Index (CFI) for the level-1 portion of the model as
Thus, we would conclude that the fit of the model at level 1 is extremely
good.
Likewise, we can calculate the level-2 CFI value in a similar fashion, by
obtaining the Chi-square goodness of fit statistic for the level-1 saturated
model.
model10.2b<-'
level: 1
V1~~V2
V1~~V3
V1~~V4
V1~~V5
V1~~V6
V1~~V7
V1~~V8
V1~~V9
V1~~V10
V1~~V11
V1~~V12
V2~~V3
V2~~V4
V2~~V5
V2~~V6
V2~~V7
V2~~V8
V2~~V9
V2~~V10
V2~~V11
V2~~V12
V3~~V4
V3~~V5
V3~~V6
Multilevel Latent Variable Models 213
V3~~V7
V3~~V8
V3~~V9
V3~~V10
V3~~V11
V3~~V12
V4~~V5
V4~~V6
V4~~V7
V4~~V8
V4~~V9
V4~~V10
V4~~V11
V4~~V12
V5~~V6
V5~~V7
V5~~V8
V5~~V9
V5~~V10
V5~~V11
V5~~V12
V6~~V7
V6~~V8
V6~~V9
V6~~V10
V6~~V11
V6~~V12
V7~~V8
V7~~V9
V7~~V10
V7~~V11
V7~~V12
V8~~V9
V8~~V10
V8~~V11
V8~~V12
V9~~V10
V9~~V11
V9~~V12
V10~~V11
V10~~V12
V11~~V12
level: 2
f1b=~V1+V2+V3+V4
214 Multilevel Modeling Using R
f2b=~V5+V6+V7+V8
f3b=~V9+V10+V11+V12
'
model10.2b.fit <- sem(model = model10.2b, estimator="ML",
data = mlm_sem, optim.force.converged=TRUE, cluster = "V13")
summary(model10.2b.fit, fit.measures=TRUE, standardized=TRUE)
2
max [level 1 saturated dflevel 1 saturated , 0]
CFIlevel 2 =1 2
(10.9)
max [ level 2 baseline dflevel 2 baseline , 0]
model10.3<-'
level: 1
f1w=~V1+V2+V3+V4
f2w=~V5+V6+V7+V8
f3w=~V9+V10+V11+V12
level: 2
f1b=~V1+V2+V3+V4
f2b=~V5+V6+V7+V8
f3b=~V9+V10+V11+V12
'
Much of this program is very similar to the more constrained versions that
we just examined. The observed indicators are labeled V1-V12, and the
Multilevel Latent Variable Models 215
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 1.000
P-value H_0: RMSEA >= 0.080 0.000
216 Multilevel Modeling Using R
For the current example, the overall fit looks very good using common
guidelines (Kline, 2016), given that the Chi-square goodness of fit test is not
statistically significant (p = 0.991), the RMSEA is 0, and the CFI and TLI both
exceed 0.95. We have already seen from our prior analyses that the CFI
values at levels 1 and 2 were both quite high, indicating a good fit in each
case. As noted, the SRMR provides information about the fit of the model at
both levels, and from these results we can conclude that the fit at level 1 is
quite good (SRMR = 0.014), and that it is acceptable at level 2 (SRMR =
0.078). The fact that the fit for the level-2 portion of the model is not as good as
that for level 1 (based on SRMR) is not completely surprising given that the
number of level-2 units (60) is much smaller than the number of level-1 units
(2000). Indeed, researchers working with multilevel CFA models must be
cognizant of the level-2 sample size in particular, because in many applica-
tions it may be too small for the accurate estimation of model parameters.
Given that the model provides a good fit to the data, we can next
proceed to interpretation of the model parameters, in particular the factor
loadings. The parameter estimates appear for both levels of the data. The
columns are ordered as unstandardized loadings, standard error of the
loadings, test statistic for the null hypothesis that the loading = 0, the p-
value for the test, the loading when only the latent variable is standard-
ized, and the loading when the latent variable and indicators are all
standardized. In addition to the loadings, we have the covariances/correla-
tions, the intercepts (set to 0), and the variances for the observed indicators
and the latent variables.
Level 1 [within]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1w =~
V1 1.000 1.035 0.725
V2 0.971 0.036 26.936 0.000 1.005 0.707
V3 0.938 0.036 26.280 0.000 0.971 0.696
V4 0.984 0.036 26.984 0.000 1.019 0.717
f2w =~
V5 1.000 0.960 0.688
V6 0.726 0.038 19.169 0.000 0.697 0.561
V7 0.660 0.036 18.294 0.000 0.633 0.531
V8 0.682 0.037 18.501 0.000 0.655 0.545
Multilevel Latent Variable Models 217
f3w =~
V9 1.000 0.958 0.693
V10 0.380 0.037 10.260 0.000 0.364 0.347
V11 0.317 0.036 8.908 0.000 0.304 0.290
V12 0.530 0.045 11.785 0.000 0.508 0.458
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1w ~~
f2w 0.615 0.039 15.761 0.000 0.619 0.619
f3w 0.231 0.034 6.748 0.000 0.233 0.233
f2w ~~
f3w 0.493 0.038 12.823 0.000 0.536 0.536
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
.V4 0.000 0.000 0.000
.V5 0.000 0.000 0.000
.V6 0.000 0.000 0.000
.V7 0.000 0.000 0.000
.V8 0.000 0.000 0.000
.V9 0.000 0.000 0.000
.V10 0.000 0.000 0.000
.V11 0.000 0.000 0.000
.V12 0.000 0.000 0.000
f1w 0.000 0.000 0.000
f2w 0.000 0.000 0.000
f3w 0.000 0.000 0.000
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.967 0.042 23.154 0.000 0.967 0.474
.V2 1.013 0.042 23.935 0.000 1.013 0.501
.V3 1.003 0.041 24.402 0.000 1.003 0.516
.V4 0.983 0.042 23.554 0.000 0.983 0.486
.V5 1.024 0.049 21.001 0.000 1.024 0.526
.V6 1.054 0.040 26.109 0.000 1.054 0.685
.V7 1.019 0.038 26.787 0.000 1.019 0.718
.V8 1.017 0.038 26.462 0.000 1.017 0.703
.V9 0.991 0.076 12.990 0.000 0.991 0.519
.V10 0.970 0.034 28.512 0.000 0.970 0.880
.V11 1.004 0.034 29.355 0.000 1.004 0.916
.V12 0.972 0.038 25.436 0.000 0.972 0.790
218 Multilevel Modeling Using R
Level 2 [V13]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1b =~
V1 1.000 0.598 1.009
V2 1.057 0.053 20.127 0.000 0.632 1.004
V3 0.971 0.052 18.783 0.000 0.580 1.000
V4 1.042 0.049 21.180 0.000 0.623 1.010
f2b =~
V5 1.000 0.748 1.000
V6 0.675 0.036 18.610 0.000 0.505 1.009
V7 0.638 0.043 14.663 0.000 0.477 0.968
V8 0.678 0.033 20.581 0.000 0.507 1.022
f3b =~
V9 1.000 0.693 1.015
V10 0.423 0.038 11.233 0.000 0.293 0.988
V11 0.241 0.038 6.380 0.000 0.167 0.916
V12 0.676 0.046 14.796 0.000 0.468 0.979
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1b ~~
f2b 0.203 0.069 2.915 0.004 0.453 0.453
f3b 0.323 0.073 4.400 0.000 0.780 0.780
f2b ~~
f3b 0.333 0.086 3.885 0.000 0.643 0.643
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.119 0.083 1.429 0.153 0.119 0.200
.V2 0.114 0.087 1.304 0.192 0.114 0.181
.V3 0.146 0.081 1.797 0.072 0.146 0.252
.V4 0.117 0.086 1.357 0.175 0.117 0.189
.V5 −0.030 0.102 −0.300 0.764 −0.030 −0.041
.V6 −0.013 0.070 −0.184 0.854 −0.013 −0.026
.V7 −0.025 0.069 −0.359 0.720 −0.025 −0.050
.V8 −0.002 0.069 −0.034 0.973 −0.002 −0.005
.V9 0.023 0.094 0.242 0.809 0.023 0.033
.V10 0.026 0.045 0.574 0.566 0.026 0.087
.V11 −0.003 0.033 −0.098 0.922 −0.003 −0.018
.V12 0.060 0.067 0.904 0.366 0.060 0.126
Multilevel Latent Variable Models 219
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 −0.006 0.005 −1.240 0.215 −0.006 −0.018
.V2 −0.003 0.007 −0.407 0.684 −0.003 −0.007
.V3 0.000 0.006 0.032 0.975 0.000 0.001
.V4 −0.008 0.006 −1.399 0.162 −0.008 −0.020
.V5 −0.000 0.009 −0.013 0.989 −0.000 −0.000
.V6 −0.004 0.006 −0.788 0.431 −0.004 −0.018
.V7 0.015 0.010 1.558 0.119 0.015 0.062
.V8 -0.011 0.005 −2.157 0.031 −0.011 −0.044
.V9 −0.014 0.013 −1.051 0.293 −0.014 −0.029
.V10 0.002 0.006 0.360 0.719 0.002 0.025
.V11 0.005 0.007 0.797 0.425 0.005 0.161
.V12 0.009 0.010 0.928 0.354 0.009 0.041
f1b 0.358 0.075 4.740 0.000 1.000 1.000
f2b 0.560 0.113 4.955 0.000 1.000 1.000
f3b 0.480 0.097 4.950 0.000 1.000 1.000
First, we can focus on the factor loadings at each level. An assumption is made
that the covariance structure is equivalent across clusters at level 2. In this
example, that would mean that we assume that the covariance matrix for the
Quality of Life subscales is the same from one treatment center to the next.
Thus, there exists only one set of factor loadings at level 1. At this level, the
factor loadings are all statistically significant, meaning that they are different
from 0, i.e., each of the indicators is related to the latent variable with which it
was hypothesized to measure. In addition, it appears that, descriptively, the
loadings for V10, V11, and V12 were somewhat smaller than the loadings for
the other indicators in the model. At level 1, all of the factors were significantly
correlated with one another, with the smallest correlation being between f1w
and f3w (0.232) and the largest being between f1w and f2w (0.618).
The Between Level Results reflect the latent structure among center-level
means of the subscales. In other words, does the hypothesized three-factor
structure hold when the indicators are the means of the quality of life
subscales for each of the centers? Based upon the results above, the answer
would appear to be yes. All of the standardized loadings are near 1, and all
are statistically significant. It should be noted that the level-2 residual
variances for these indicators were very small, and in one case even
negative. These low values are a consequence of trying to estimate the
factor structure using a relatively small level-2 sample (N = 60), combined
with the strong underlying factor structure which results in very large
factor loadings, and resultant small error variances at level-2.
220 Multilevel Modeling Using R
model10.4<-'
level: 1
f1w=~V1+a*V2+b*V3+c*V4
f2w=~V5+d*V6+e*V7+g*V8
f3w=~V9+h*V10+i*V11+j*V12
level: 2
f1b=~V1+a*V2+b*V3+c*V4
f2b=~V5+d*V6+e*V7+g*V8
f3b=~V9+ h*V10+i*V11+j*V12
'
The only difference between this model and model_10.3 is that here we
have constrained the factor loadings for common variables to be equal
across levels. We do this by using the same letter (e.g., a, b, c) for the
variables at level 1 and level 2. Thus, the loading for V2 with f1w is
constrained to be equal to the loading for V2 with f1b because of the b on
each. We can see the equality of the loadings reflected in the resulting
output.
Level 1 [within]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1w =~
V1 1.000 1.018 0.717
V2 (a) 1.002 0.029 34.101 0.000 1.019 0.713
V3 (b) 0.951 0.029 33.124 0.000 0.968 0.695
V4 (c) 1.007 0.029 35.229 0.000 1.025 0.719
Multilevel Latent Variable Models 221
f2w =~
V5 1.000 0.970 0.693
V6 (d) 0.700 0.026 27.068 0.000 0.679 0.550
V7 (e) 0.650 0.027 23.957 0.000 0.631 0.530
V8 (g) 0.682 0.024 28.030 0.000 0.661 0.549
f3w =~
V9 1.000 0.921 0.669
V10 (h) 0.406 0.026 15.632 0.000 0.373 0.355
V11 (i) 0.290 0.025 11.571 0.000 0.267 0.257
V12 (j) 0.600 0.034 17.744 0.000 0.553 0.494
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1w ~~
f2w 0.611 0.036 16.780 0.000 0.619 0.619
f3w 0.221 0.032 6.848 0.000 0.236 0.236
f2w ~~
f3w 0.482 0.036 13.336 0.000 0.540 0.540
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
.V4 0.000 0.000 0.000
.V5 0.000 0.000 0.000
.V6 0.000 0.000 0.000
.V7 0.000 0.000 0.000
.V8 0.000 0.000 0.000
.V9 0.000 0.000 0.000
.V10 0.000 0.000 0.000
.V11 0.000 0.000 0.000
.V12 0.000 0.000 0.000
f1w 0.000 0.000 0.000
f2w 0.000 0.000 0.000
f3w 0.000 0.000 0.000
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.977 0.041 23.982 0.000 0.977 0.485
.V2 1.004 0.041 24.238 0.000 1.004 0.492
.V3 1.004 0.040 24.863 0.000 1.004 0.517
.V4 0.979 0.041 24.030 0.000 0.979 0.483
.V5 1.016 0.047 21.847 0.000 1.016 0.519
.V6 1.062 0.039 26.893 0.000 1.062 0.697
.V7 1.020 0.037 27.230 0.000 1.020 0.719
222 Multilevel Modeling Using R
Level 2 [V13]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1b =~
V1 1.000 0.613 1.009
V2 (a) 1.002 0.029 34.101 0.000 0.614 1.002
V3 (b) 0.951 0.029 33.124 0.000 0.583 1.000
V4 (c) 1.007 0.029 35.229 0.000 0.617 1.010
f2b =~
V5 1.000 0.737 0.999
V6 (d) 0.700 0.026 27.068 0.000 0.516 1.009
V7 (e) 0.650 0.027 23.957 0.000 0.479 0.969
V8 (g) 0.682 0.024 28.030 0.000 0.502 1.022
f3b =~
V9 1.000 0.703 1.024
V10 (h) 0.406 0.026 15.632 0.000 0.285 0.978
V11 (i) 0.290 0.025 11.571 0.000 0.204 0.935
V12 (j) 0.600 0.034 17.744 0.000 0.422 0.957
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1b ~~
f2b 0.203 0.070 2.904 0.004 0.449 0.449
f3b 0.333 0.075 4.440 0.000 0.773 0.773
f2b ~~
f3b 0.330 0.085 3.895 0.000 0.637 0.637
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.118 0.085 1.393 0.164 0.118 0.194
.V2 0.114 0.085 1.335 0.182 0.114 0.186
.V3 0.146 0.082 1.786 0.074 0.146 0.250
.V4 0.116 0.085 1.367 0.172 0.116 0.191
.V5 −0.030 0.100 −0.302 0.763 −0.030 −0.041
.V6 −0.014 0.072 −0.196 0.844 −0.014 −0.028
.V7 −0.025 0.069 −0.363 0.716 −0.025 −0.051
Multilevel Latent Variable Models 223
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 −0.007 0.005 −1.282 0.200 −0.007 −0.018
.V2 −0.001 0.007 −0.202 0.840 −0.001 −0.004
.V3 −0.000 0.006 −0.020 0.984 −0.000 −0.000
.V4 −0.007 0.006 −1.320 0.187 −0.007 −0.020
.V5 0.001 0.009 0.131 0.896 0.001 0.002
.V6 −0.005 0.006 −0.815 0.415 −0.005 −0.018
.V7 0.015 0.010 1.542 0.123 0.015 0.061
.V8 −0.011 0.005 −2.167 0.030 −0.011 −0.045
.V9 −0.023 0.012 −1.834 0.067 −0.023 −0.048
.V10 0.004 0.006 0.588 0.557 0.004 0.044
.V11 0.006 0.007 0.859 0.390 0.006 0.125
.V12 0.016 0.011 1.508 0.132 0.016 0.084
f1b 0.376 0.077 4.893 0.000 1.000 1.000
f2b 0.543 0.108 5.007 0.000 1.000 1.000
f3b 0.494 0.097 5.092 0.000 1.000 1.000
The ICC (proportion of the latent variance accounted for at level 2) for factor
1 is then calculated as
0.376 0.376
ICCF1 = = = 0.268
0.376 + 1.036 1.413
Therefore, we can conclude that approximately 27% of the variation in the first
latent variable is due to between treatment center variance. Similar calcula-
tions for the other two factors yielded ICCs of 0.367 and 0.343, respectively.
much the same way that we do with observed variables using regression
models. And, just as we can extend single-level regression models to the
multilevel context, so can we employ SEM in the multilevel context. Indeed,
many of the key ideas that we described in chapter 3 for multilevel regression
models can be applied in the latent variable context with multilevel SEM. In
this section, we will describe the basics of fitting multilevel SEMs using the
lavaan package in R. We should note here that as of September 2023,
lavaan allows for a random intercept model, but not random slopes.
The single-level SEM is written as
=B + (10.10)
where
where
Bc = c + BB Bc + Bc (10.12)
where
model10.5<-'
level: 1
f1w=~V1+V2+V3+V4
f2w=~V5+V6+V7+V8
f3w=~V9+V10+V11+V12
f1w~f2w+f3w
level: 2
f1b=~V1+V2+V3+V4
f2b=~V5+V6+V7+V8
f3b=~V9+V10+V11+V12
f1b~f2b+f3b
V1~~0*V1
V5~~0*V5
V9~~0*V9
'
This program resembles those used in the CFA examples above, with the
primary addition being the inclusion of the structural portion of the model
226 Multilevel Modeling Using R
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 1.000
P-value H_0: RMSEA >= 0.080 0.000
These results are very similar to those of the full CFA model (Model 10.3),
and indicate a very good fit. Again, it is important to keep in mind that
Multilevel Latent Variable Models 227
except for the SRMR, these indices are dominated by the level-1 portion of
the model because its sample size is so much larger (2,000 versus 60).
Finally, we will consider the model parameter estimates. In particular, we
are interested in the structural coefficients, as opposed to the factor loadings,
which were our primary concern in the context of CFA. Indeed, once we have
established that the individual measurement models fit the data, we are
typically not very interested in the factor loading estimates. For this reason,
we only include the standardized structural values for the model next.
Level 1 [within]:
Two-Tailed
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1w ~
f2w 0.748 0.053 14.246 0.000 0.693 0.693
f3w −0.149 0.047 −3.162 0.002 −0.138 −0.138
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f2w ~~
f3w 0.493 0.038 12.848 0.000 0.535 0.535
Level 2 [V13]:
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f1b ~
f2b −0.066 0.105 −0.632 0.528 −0.083 −0.083
f3b 0.727 0.121 5.993 0.000 0.833 0.833
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
f2b ~~
f3b 0.329 0.085 3.866 0.000 0.644 0.644
Based upon these results, we can conclude that at level 1, latent variable 2
has a statistically significant positive relationship with latent variable 1,
whereas latent variable 3 has a significantly negative relationship with
latent variable 1. In other words, individuals with higher values on the
second factor also have higher values on the first factor, whereas those with
higher values on the third factor have lower values on the second factor. At
level 2, factor 2 is not related to factor 1 (p = 0.470), but factor 3 has a
statistically significant positive association with factor 1. This means that
when the average factor 3 score at a center was higher, the average factor 1
score was higher as well.
228 Multilevel Modeling Using R
model10.6<-'
level: 1
iw=~1*V1+1*V2+1*V3+1*V4+1*V5+1*V6
sw=~0*V1+1*V2+2*V3+3*V4+4*V5+5*V6
V1~~a*V1
V2~~a*V2
V3~~a*V3
V4~~a*V4
V5~~a*V5
V6~~a*V6
level: 2
ib=~1*V1+1*V2+1*V3+1*V4+1*V5+1*V6
sb=~0*V1+1*V2+2*V3+3*V4+4*V5+5*V6
V1~~0*V1
V2~~0*V2
V3~~0*V3
V4~~0*V4
V5~~0*V5
V6~~0*V6
'
Multilevel Latent Variable Models 229
Level 1 [within]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
iw =~
V1 1.000 1.052 0.830
V2 1.000 1.052 0.693
V3 1.000 1.052 0.497
V4 1.000 1.052 0.370
V5 1.000 1.052 0.290
V6 1.000 1.052 0.237
sw =~
V1 0.000 0.000 0.000
V2 1.000 0.855 0.563
V3 2.000 1.709 0.808
V4 3.000 2.564 0.901
V5 4.000 3.418 0.942
V6 5.000 4.273 0.962
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
iw ~~
sw −0.014 0.040 −0.340 0.734 −0.015 −0.015
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
.V4 0.000 0.000 0.000
.V5 0.000 0.000 0.000
230 Multilevel Modeling Using R
Variances:
Level 2 [V7]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ib =~
V1 1.000 0.478 1.000
V2 1.000 0.478 0.556
V3 1.000 0.478 0.331
V4 1.000 0.478 0.231
V5 1.000 0.478 0.177
V6 1.000 0.478 0.143
sb =~
V1 0.000 0.000 0.000
V2 1.000 0.647 0.753
V3 2.000 1.294 0.896
V4 3.000 1.941 0.939
V5 4.000 2.588 0.958
V6 5.000 3.235 0.969
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ib ~~
sb 0.046 0.054 0.849 0.396 0.149 0.149
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
Multilevel Latent Variable Models 231
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
.V4 0.000 0.000 0.000
.V5 0.000 0.000 0.000
.V6 0.000 0.000 0.000
ib 0.228 0.066 3.449 0.001 1.000 1.000
sb 0.419 0.090 4.657 0.000 1.000 1.000
These results indicate that the average standardized growth was 1.061
points per year, and the mean standardized starting score was 0.791, both of
which were statistically significantly different from 0. In addition, there was
not a significant relationship between the starting score and growth over
time, with the sample correlation between these two parameters estimated
to be 0.149. Finally, there was statistically significant variation in the degree
2
of growth among individuals within schools ( ˆSW = 0.730) as well as
2
significant variance in the starting values within schools ( ˆIW = 1.106). In
other words, not all of the students started at the same level, and not all of
them had scores change at the same rate over time.
e aj ( i bj)
P (xj = 1| i , aj , bj ) = (10.13)
1 + e a j ( i bj )
232 Multilevel Modeling Using R
where
e aj ( ic , bj )
P (xj = 1| ic , aj , bj ) = (10.14)
1 + e aj ( ic , bj )
The terms in Model (10.14) are identical to those in (10.13), with the exception
of the latent trait being measured. In the single-level analysis, i is simply the
latent trait for individual i. However, in the multilevel context, the latent trait
becomes ic , or the latent trait for individual i in cluster c. As described by
Kamata and Vaughn, ic can be decomposed into two components c , which is
the mean level of the latent trait for cluster c, and ic , the deviation from this
mean by individual i in cluster c. In addition, researchers can include
covariates of the latent trait in this model, and the covariates can occur at
either level 1, level 2, or both.
covdata<-as.factor(mlm_irt[,31])
covdata.df<-data.frame(covdata)
Multilevel Latent Variable Models 233
names(covdata.df)<-"group"
mlm_irt.df<-data.frame(mlm_irt)
Prior to fitting the model, we need to organize the data so that we can then
define the group membership in order to appropriately account for the
multilevel structure of the data. First, we create a data frame in R that
includes only the group variable (covdata), and we name the column
"group". We then ensure that the mlm_irt matrix of item responses is also
a dataframe, which the mixedmirt function requires. We can now fit the
multilevel 2PL model. The results of the model are saved in an output object
called mixed.2pl. In the function call, we first identify the dataframe and
item responses (columns 1–30). We define the covariate dataframe, which in
this case includes the level-2 grouping variable (covdata.df). The model
is unidimensional, which we indicate with 1, and we specify the item type
as 2PL. The fixed effects are the items and the random effect is the group.
Finally, we request the results including the standard errors, using the coef
command from the mirt library.
First, we examine the variance associated with the random effects, Theta
(level-1) and COV_group (level-2).
summary(mixed.2pl)
--------------
RANDOM EFFECT COVARIANCE(S):
Correlations on upper diagonal
$Theta
F1
F1 1
$group
COV_group
COV_group 0.167
0.167 0.167
= = 0.14.
1 + 0.167 1.167
234 Multilevel Modeling Using R
The Next, we can extract the information indices that reflect the fit of the
model. If we were interested in fitting alternative models to the data, such
as a 1PL or Rasch, we could compare the fit of the various models using
these information indices. As in other contexts, smaller values of these
statistics reflect a better fit.
extract.mirt(mixed.2pl,"SABIC")
[1] 61749.16
extract.mirt(mixed.2pl,"BIC")
[1] 61942.95
extract.mirt(mixed.2pl,"AIC")
[1] 61614.91
mirt::coef(mixed.2pl, printSE=TRUE)
$Item_1
a1 d g u
par 0.979 -0.007 -999 999
SE 0.083 0.033 NA NA
$Item_2
a1 d g u
par 0.746 -0.018 -999 999
SE 0.073 0.032 NA NA
$Item_3
a1 d g u
par 0.866 −0.095 −999 999
SE 0.077 0.032 NA NA
$Item_4
a1 d g u
par 0.783 −0.103 −999 999
SE 0.074 0.033 NA NA
$Item_5
a1 d g u
par 0.930 −0.079 −999 999
SE 0.077 0.032 NA NA
$Item_6
a1 d g u
par 0.903 −0.160 −999 999
SE 0.080 0.033 NA NA
Multilevel Latent Variable Models 235
$Item_7
a1 d g u
par 0.905 −0.042 −999 999
SE 0.076 0.032 NA NA
$Item_8
a1 d g u
par 0.871 −0.071 −999 999
SE 0.075 0.032 NA NA
$Item_9
a1 d g u
par 0.751 −0.120 −999 999
SE 0.072 0.032 NA NA
$Item_10
a1 d g u
par 0.804 −0.031 −999 999
SE 0.078 0.032 NA NA
$Item_11
a1 d g u
par 0.677 −0.097 −999 999
SE 0.067 0.032 NA NA
$Item_12
a1 d g u
par 0.760 −0.155 −999 999
SE 0.072 0.032 NA NA
$Item_13
a1 d g u
par 0.872 −0.152 −999 999
SE 0.077 0.033 NA NA
$Item_14
a1 d g u
par 0.829 −0.014 −999 999
SE 0.073 0.032 NA NA
$Item_15
a1 d g u
par 0.680 −0.134 −999 999
SE 0.068 0.032 NA NA
236 Multilevel Modeling Using R
$Item_16
a1 d g u
par 0.667 −0.145 −999 999
SE 0.068 0.032 NA NA
$Item_17
a1 d g u
par 0.810 −0.040 −999 999
SE 0.072 0.032 NA NA
$Item_18
a1 d g u
par 0.779 −0.124 −999 999
SE 0.069 0.031 NA NA
$Item_19
a1 d g u
par 0.840 −0.124 −999 999
SE 0.076 0.032 NA NA
$Item_20
a1 d g u
par 0.986 −0.078 −999 999
SE 0.081 0.034 NA NA
$Item_21
a1 d g u
par 0.706 −0.164 −999 999
SE 0.070 0.033 NA NA
$Item_22
a1 d g u
par 0.716 −0.055 −999 999
SE 0.070 0.032 NA NA
$Item_23
a1 d g u
par 0.894 −0.105 −999 999
SE 0.076 0.033 NA NA
$Item_24
a1 d g u
par 0.876 −0.059 −999 999
SE 0.076 0.032 NA NA
Multilevel Latent Variable Models 237
$Item_25
a1 d g u
par 0.823 −0.058 −999 999
SE 0.075 0.033 NA NA
$Item_26
a1 d g u
par 0.776 −0.194 −999 999
SE 0.071 0.032 NA NA
$Item_27
a1 d g u
par 0.842 −0.067 −999 999
SE 0.074 0.031 NA NA
$Item_28
a1 d g u
par 0.907 −0.060 −999 999
SE 0.077 0.032 NA NA
$Item_29
a1 d g u
par 0.821 −0.087 −999 999
SE 0.074 0.032 NA NA
$Item_30
a1 d g u
par 0.788 −0.013 −999 999
SE 0.076 0.032 NA NA
$GroupPars
MEAN_1 COV_11
par 0 1
SE NA NA
$group
COV_group_group
par 0.167
SE 0.031
The item discrimination parameters appear in the column labeled a1, with
item difficulty appearing in the d column. The item parameter estimate
appears in the par row and the standard errors in the SE row. Thus, for
238 Multilevel Modeling Using R
example, the item discrimination and difficulty estimates for item 1 are
0.979 and −0.007, respectively.
We can extract the random effects (level-1 theta estimates and level-2
mean theta by group) using the randef command in R. Once we have
these estimates, we can examine them using histograms (Figures 10.1
and 10.2).
300
250
200
Frequency
150
100
50
-3 -2 -1 0 1 2
eff.2pl$Theta
FIGURE 10.1
Histogram showing the distribution of level-1 Latent trait estimates.
Multilevel Latent Variable Models 239
20
15
Frequency
10
FIGURE 10.2
Histogram showing the distribution of level-2 Latent trait estimates.
mixed.rasch<-mixedmirt(mlm_irt[,1:30],covdata=covdata.
df,1,itemtype=‘Rasch’,fixed=~0+items,random=~1|group)
The model fit statistics for the multilevel Rasch model appear next.
extract.mirt(mixed.rasch,"SABIC")
[1] 61658.45
240 Multilevel Modeling Using R
> extract.mirt(mixed.rasch,"BIC")
[1] 61760.11
> extract.mirt(mixed.rasch,"AIC")
[1] 61588.02
These values are lower than those from the multilevel 2PL model, suggesting
that the multilevel Rasch model provides a better fit to the data than does the
2PL, once model complexity is taken into account.
T
X1 X2 X3 … Xj , Y Y Xj| Y
1,2,3…, jt = t jt (10.17)
t
where
Y
= Probability that a randomly selected individual will be in class t
t
of latent variable Y
Xj | Y
jt= Probability that a member of latent class t will provide a
particular response to observed indicator j
where
exp( tm )
P (Cij = t|CBj = m) = T (10.19)
r =1 exp( rm)
242 Multilevel Modeling Using R
where
Estimating MLCA in R
Given that the nonparametric model was shown to provide better latent class
recovery in many situations, it will be the focus in terms of our application
with the mutlilevLCA library in R. For this example, we will use data from
the Response to Educational Disruption Survey (REDS). Specifically, we are
interested in identifying latent classes of students in Denmark based upon
their responses to dichotomous survey items regarding school support
during the COVID-19 pandemic. Responses were coded as 1 (Yes) and 0
(No) for each of the support items. The individual items appear in Table 10.1.
For the current example, there were 1,534 students from 75 schools
included in the analysis. We will treat school as the clustering variable. The
R commands used to fit the model for three level-1 and two level-2 latent
classes appear next.
denmark_lca_multi_3_2
Multilevel Latent Variable Models 243
I1 I2 I3 I4 I5 I6 I1 I2 I3 I4 I5 I6
1
1
Within
Within cluster
cluster latent class
latent class
Latent class 2
mean
Level-2
Level-2
Common
factor
U
Means for for latent
U within Latent class 3 class means
cluster mean
latent classes
I1 I2 I3 I4 I5 I6
I1 I2 I3 I4 I5 I6
1
1
Within
cluster
Within latent class
cluster
latent class
Within
cluster Level-2
latent class
Within 2 mean
cluster Common
latent class 2
Level-2 Between
factor for cluster
mean Within latent class latent class
cluster means
Between latent class
cluster 3 mean
Within latent class
cluster
latent class 3 U
mean
FIGURE 10.3
Multilevel latent class models
Note: There will be T-1 latent class means, where T = number of latent classes. Latent class 1
will always have a mean of 0.
TABLE 10.1
COVID-19 Support Items
Item
iM=2. We then fit the multilevel LCA model and save the results in an
output object called denmark_lca_multi_3_2. The function call requires
that we include the data set name (mlm_lca), the names of the variables
used to create the latent classes (Y), the number of level-1 (iT) and level-2
(iM) classes, and the name of the variable denoting level-2 membership
(IDSCHOOL). The final line requests the output be printed out, and those
results appear next.
CALL:
NULL
SPECIFICATION:
Multilevel LC model
ESTIMATION DETAILS:
First, we see that the model converged. Had it not converged, the message
to that effect would have appeared here.
GROUP PROPORTIONS:
P(G1) 0.8268
P(G2) 0.1732
CLASS PROPORTIONS:
G1 G2
P(C1|G) 0.4445 0.2650
Multilevel Latent Variable Models 245
Next, we see the probability of cases in each class at levels 1 and 2. We see
that for the level-2 structure, approximately 83% of schools were in the first
group, with approximately 17% of the schools belonging to the second
group. The next table displays the proportion of level-1 units (students) in
each of the level-2 groups. Thus, group 1 included 44% of students in latent
class 1, 43% in latent class 2, and 13% in latent class 3. Group 2 included 27%
from class 1, 19% from class 2, and 55% from latent class 3.
The latent class item response probabilities for a response of 1 by latent
class appear in the following table. From these results, we can see that
individuals in latent class 1 were the most likely to respond Yes to each of
the support during the pandemic items. In contrast, respondents in class 3
were the least likely to respond Yes, and response probabilities for those in
class 2 fell in between those in the other latent classes.
RESPONSE PROBABILITIES:
C1 C2 C3
P(IS1G21A|C) 0.9786 0.8267 0.0283
P(IS1G21B|C) 0.9796 0.7944 0.0324
P(IS1G21C|C) 0.9837 0.7466 0.0085
P(IS1G21D|C) 0.9065 0.3543 0.0193
P(IS1G21E|C) 0.9953 0.5403 0.0082
P(IS1G21F|C) 0.9834 0.6749 0.0094
P(IS1G21G|C) 0.8389 0.1998 0.0000
P(IS1G21H|C) 0.6595 0.1984 0.0000
---------------------------
The model fit statistics appear next. These can be used to compare the fit of
multiple models and select the optimal solution. In addition to the AIC,
there are also BIC, ICLBIC, and R2 entropy values for level 1 (low) and level
2 (high). The information indices (AIC, BIC, ICLBIC) are interpreted in the
same manner as for other models, i.e., lower values indicate a better-fitting
model, after a penalty for complexity is taken into account. The R2 entropy
value reflects the improvement in the prediction of class membership when
using the observed variables (e.g., item responses) and without using the
variables (i.e., random assignment). Thus, values closer to 1 indicate that the
model is better able to classify individuals. In practice, the R2 entropy
statistics at level 1 and level 2 can be interpreted as indicators of the
separation of latent classes at each level, with higher values meaning greater
class separation. These statistics can be used to compare models as part of
the strategy for optimal model selection.
246 Multilevel Modeling Using R
AIC 10802.0871
BIClow 10956.8205
BIChigh 10869.2943
ICLBIClow 11450.7194
ICLBIChigh 10881.6994
R2entrlow 0.8445
R2entrhigh 0.8206
The model fit statistics from the output appear next. The information
indices for this model are all lower than those for the model excluding the
covariates, suggesting that it provides a better fit to the data. On the other
hand, the R2 entropy values are lower for the covariate model, indicating
that the latter yields less certain class membership than the former.
However, given the work by Varriale and Vermunt (2010), the BIC is
more likely to yield accurate results for identifying the optimal model. Thus,
we would conclude that including the covariates resulted in a better-fitting
model than did the simple multilevel latent class model.
AIC 5645.308
BIClow 5816.6921
Multilevel Latent Variable Models 247
BIChigh 5706.8598
ICLBIClow 6088.5998
ICLBIChigh 5720.8033
R2entrlow 0.7976
R2entrhigh 0.6168
P(G1) 0.8268
P(G2) 0.1732
G1 G2
P(C1|G) 0.4232 0.7336
P(C2|G) 0.5131 0.2007
P(C3|G) 0.0636 0.0657
RESPONSE PROBABILITIES:
C1 C2 C3
P(IS1G21A|C) 0.9786 0.8267 0.0283
P(IS1G21B|C) 0.9796 0.7944 0.0324
P(IS1G21C|C) 0.9837 0.7466 0.0085
P(IS1G21D|C) 0.9065 0.3543 0.0193
P(IS1G21E|C) 0.9953 0.5403 0.0082
P(IS1G21F|C) 0.9834 0.6749 0.0094
P(IS1G21G|C) 0.8389 0.1998 0.0000
P(IS1G21H|C) 0.6595 0.1984 0.0000
Next, we can examine the logistic regression results relating each of the
covariates with the latent class variables at each level. First, we see the
results for the level-2 covariates. Neither of them is statistically significantly
related to group membership (p-values all exceeded 0.05).
---------------------------
For the level-1 covariates, the relationships are conditional upon level-2
latent class membership. For individuals whose school was placed in
level-2 latent class 1 (G1), home language status was negatively related
to latent class 2 versus 1. Given that language is coded as 1 (home
language is same as test language) or 0 (home language differs from test
language), a negative coefficient indicates that those coded as 1 had a
lower probability of being in latent class 2 versus latent class 1. In other
words, those whose home language was the same as that of the test were
less likely to be in latent class 2 (those who reported somewhat less
school support during the pandemic as compared to those in class 1).
There was not a significant relationship between either level-1 covariate
and being in class 3 versus 1, when the school was in level-2 class 1. For
students whose school was in level-2 class 2 (G2), home language was
significantly related to class membership. Specifically, students whose
home language matched that of the test were less likely to be in level-1
class 2 vis-à-vis level-1 class 1. In contrast, home language matching
that of the test was positively associated with being in level-1 class 3 as
compared to class 1.
Summary
In this chapter, we have extended the modeling of latent variables to the
multilevel context. In many instances, these models resembled the ones that
we used for observed variables such as regression (chapter 3). In other
cases, however, we are working with completely new ways of thinking
about and modeling our data. In particular, multilevel LCA presents us
with some very interesting, potentially informative, and definitely chal-
lenging models. Just as with single-level latent variable modeling, of
paramount importance is the linking of theory to the model and data that
we use. Given the complexity of the methods discussed in this chapter, it
can become very easy for the researcher to drift away from the solid footing
of established theory into uncharted and deep waters with the help of
difficult to fit and understand models. Our goal in this chapter was only to
introduce the interested reader to the many options available in the latent
variable modeling context. The full breadth of such topics is well beyond
the purview of a single chapter and indeed could easily be contained within
an entire book. However, we do hope that this work does provide
researchers with a starting point from which they can embark on their
exploration of the finer details of multilevel models and methods for latent
variables that are of particular interest to them.
11
Additional Modeling Frameworks
for Multilevel Data
yi = 1 x1i + Cj j + ij (11.1)
where
library(AER)
library(lme4)
library(lmerTest)
data(Fatalities)
Fatalities$fatalities_per_mile<-Fatalities$fatal/
Fatalities$milestot
Fatalities$young_rate<-Fatalities$youngdrivers/
Fatalities$milestot
Scaled residuals:
Min 1Q Median 3Q Max
−3.5137 −0.4438 −0.0317 0.4158 3.9695
254 Multilevel Modeling Using R
Random effects:
Groups Name Variance Std.Dev.
state (Intercept) 4.835e−05 0.006953
Residual 6.221e−06 0.002494
Number of obs: 336, groups: state, 48
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.093e−02 1.148e-03 5.547e+01 18.235 <2e−16 ***
young_rate 4.260e+02 4.606e+01 3.004e+02 9.249 <2e−16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
young_rate −0.470
The coefficient estimate for the relationship between the youth driving and
fatality rates is 426 and is statistically significant. Thus, we can conclude that
for an increase of 1 young driver per mile, there is an increase in 426
fatalities per mile.
Next, let’s fit the FEM using the following R code:
We use the lm function to fit the model, just as we did for regression in
Chapter 1. The core portion of the model call is fatalities_per_mile~
young_rate + state – 1. By including the state variable, we account for
the level-2 structure of the data and by including −1, we exclude the overall
intercept. We can obtain the results using the summary command.
summary(fem.fatalities)
Call:
lm(formula = fatalities_per_mile ~ young_rate + state − 1,
data = Fatalities)
Residuals:
Min 1Q Median 3Q Max
-0.0074236 -0.0011486 -0.0001027 0.0011260 0.0097668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
young_rate 5.515e+02 5.246e+01 10.513 < 2e−16 ***
Additional Modeling Frameworks for Multilevel Data 255
The coefficient linking the youth driving rate with the fatality rate is 515.5
and is statistically significant. This relationship is qualitatively similar to
that for the MLM, reflecting that more young drivers per mile were
associated with a higher fatality rate per mile. The intercepts for each state
also appear in the output and reflect the mean fatality rate per mile when
the young driver per mile rate is 0. Of course, this is a situation that is unlikely
to actually occur in practice. Finally, the model R2 is 0.99, suggesting that
together the state and the youth driver rate per mile account for 99% of the
variance in the fatality rate per mile.
J duj
U( ) = Vj 1 (yj Xj ) (11.2)
j =1 d
where
n
ˆ ] 1 ˆ ] 1
[X WX X (y ˆ )(y ˆ )‘X [X WX (11.3)
i =1
where
X = matrix of predictors
y = value of response variable
ˆ = estimated mean of response variable
Ŵ = diagonal matrix of final weights, if applicable
1 0 0
VI = 0 1 0
0 0 1
Here, we can see that there are no correlations among the measurements
within the same level-2 unit. The exchangeable covariance matrix is
1
VE = 1
1
1 12 23
VU = 12 1 23
13 23 1
1 2
VAR1 = 1
2 1
Selection of the optimal covariance structure can be done using the Quasi
Information Criterion (QIC). Although GEE does not produce a true
likelihood, given that it is estimated using a least squares criterion, an
analog to the common information criteria (e.g., AIC, BIC) has been
developed for use with GEE (Hardin & Hilbe, 2003; Pan, 2001). As with
other information indices, lower values of the QIC suggest a better fit to the
data. Per the recommendations of Ekstrom et al. (2022), the correlation
information criterion (CIC) may be more robust than QIC to misspecifica-
tion of the model. In practice, the researcher will fit models with different
covariance structures and then compare them using the QIC statistics. The
model/covariance combination reflecting the lowest value of QIC or CIC
provides the best fit to the data. QIC takes the form:
1
QIC = 2Q ( ˆ ; I ) + 2trace ˆ I VˆR (11.4)
where
1
CIC = trace ˆ I VˆR (11.5)
Additional Modeling Frameworks for Multilevel Data 259
library(geepack)
exch.fatalities<-geeglm(fatalities_per_mile~young_rate,
id=state, corstr="exchangeable", family=gaussian,
data=Fatalities)
summary(exch.fatalities)
Call:
geeglm(formula = fatalities_per_mile ~ young_rate, family =
gaussian, data = Fatalities, id = state, corstr =
"exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 2.096e−02 1.089e−03 370.86 < 2e−16 ***
young_rate 4.232e+02 9.239e+01 20.98 4.65e−06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimate Std.err
(Intercept) 5.302e−05 9.822e−06
Link = identity
The function call for geeglm is straightforward. First, we define the model
much as we do with regression and MLMs. We need to indicate the
distribution family of the outcome variable, in this case normal/Gaussian,
which variable denotes the level-2 units (state), and the name of the
dataframe containing these variables. Finally, we need to indicate which
correlation structure (corstr) we would like to use, exchangeable in this case.
The regression coefficient is estimated as 432.2, meaning that for an increase of
1 young driver per mile, there is an increase of 432.2 fatalities per mile. This
result is very similar to that of the MLM, which we saw above. The common
correlation estimate is labeled alpha and has a value of 0.8829, suggesting that
fatality rates per mile are quite similar within the same state across years.
In order to fit GEE with an unstructured correlation matrix, we use the
following command, in particular, corstr="unstructured". All other
aspects of the function call remain the same as before.
unstr.fatalities<-geeglm
(fatalities_per_mile~young_rate, id=state, corstr=
"unstructured", family=gaussian, data=Fatalities)
summary(unstr.fatalities)
Call:
geeglm(formula = fatalities_per_mile ~ young_rate, family =
gaussian, data = Fatalities, id = state, corstr =
"unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 2.18e−02 8.05e−02 0.07 0.79
young_rate 2.87e+02 7.91e+03 0.00 0.97
Estimate Std.err
(Intercept) 4.38e−05 0.00042
Link = identity
Additional Modeling Frameworks for Multilevel Data 261
When we allow the correlation structure to vary across time, the estimate of
the relationship between the youth driving rate and fatality rate (287) is
smaller than for the exchangeable correlation structure or MLM. In addition,
note that the correlation estimates themselves appear to be somewhat
unstable, as are the standard errors. This latter result may call into question
the overall suitability/stability of these estimates, as do the large standard
errors. It seems likely that the data may not support estimation of the
unstructured correlation matrix, which can be a problem associated with this
framework. Not in frequently, in practice it may be difficult for the algorithm
to converge on stable parameter estimates with an unstructured covariance
matrix. In such cases, the researcher may be better suited using a somewhat
simpler approach such as an exchangeable structure.
Finally, given that the data consist of repeated measurements over
consecutive years, we might consider using the AR1. This model can be
fit by using orstr="ar1"c, as below.
ar1.fatalities<-geeglm(fatalities_per_mile~young_rate,
id=state, corstr="ar1", family=gaussian, data=Fatalities)
summary(ar1.fatalities)
262 Multilevel Modeling Using R
Call:
geeglm(formula = fatalities_per_mile ~ young_rate, family =
gaussian, data = Fatalities, id = state, corstr = "ar1")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 2.13e−02 1.98e−03 115.58 <2e−16 ***
young_rate 4.24e+02 1.62e+02 6.82 0.009 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimate Std.err
(Intercept) 5.32e−05 1.52e−05
Link = identity
QIC(unstr.fatalities, exch.fatalities)
QIC QICu Quasi Lik CIC params QICC
unstr.fatalities 144095.8 4.01 −0.00736 72047.9 2 144141.8
exch.fatalities 37.4 4.02 −0.00891 18.7 2 37.9
QIC(ar1.fatalities, exch.fatalities)
QIC QICu Quasi Lik CIC params QICC
ar1.fatalities 76.2 4.02 −0.00893 38.1 2 76.7
exch.fatalities 37.4 4.02 −0.00891 18.7 2 37.9
QIC(unstr.fatalities, ar1.fatalities)
QIC QICu Quasi Lik CIC params QICC
unstr.fatalities 144095.8 4.01 −0.00736 72047.9 2 144141.8
ar1.fatalities 76.2 4.02 −0.00893 38.1 2 76.7
Additional Modeling Frameworks for Multilevel Data 263
mi = a0 + a1 x1i + ei (11.6)
where
where
Parameter estimates from Models (11.6) and (11.7) can be used to calculate
the indirect relationship between x and y, as well as the total relationship
between the two variables. The indirect effect is expressed as
IND = a1 b1 (11.8)
264 Multilevel Modeling Using R
Researchers have shown that the indirect effect does not follow a normal
distribution (MacKinnon et al., 2002) and that the bootstrap offers a viable
alternative for obtaining confidence intervals for this effect (Falk et al., 2023).
Many of the principles and structures present in the mediation model are
inherent in the multilevel context as well. For example, the models in (11.6)
and (11.7) can be extended to the multilevel context as
The terms in (11.10) and (11.11) are analogous to those in (11.6) and (11.7),
with the j subscript used to denote the level-2 unit. These models rest on the
assumption that the level-1 errors are normally distributed and uncorre-
lated with the independent variables and random effects. Likewise, the
level-2 errors are also assumed to be normally distributed and independent
of the other terms in the model.
The random effects in the multilevel mediation model (intercepts and
slopes for the mediator and outcome variable models) are assumed to
follow the multivariate normal distribution and are indexed as:
u0j = Random intercept effect for the mediation model of level-2 unit j
u1j = Random slope effect for the mediation model of level-2 unit j
v0j = Random intercept effect for the outcome variable model of level-2
unit j
v1j = Random slope effect of the predictor for the outcome variable
model of level-2 unit j
v2j = Random slope effect of the mediator for the outcome variable
model of level-2 unit j
The indirect effect of the predictor on the outcome through the mediator is
then written as
where
ua1j v b1j = covariance between the random effects for the predictor on
the mediator and the mediator on the outcome.
Additional Modeling Frameworks for Multilevel Data 265
When the random coefficients are not included in the model, ua1j vb1j = 0 and
the indirect effect takes the same form as in the single-level case.
Were the researcher to use a standard approach to mediation modeling by
fitting models (11.10) and (11.11) separately and then using (11.12) to
calculate the indirect effect, ua1j vb1j would not be available, given that the
two models were fit separately. Bauer et al. (2006) proposed a methodology
to address this issue involving the restructuring of the data such that a new
dependent variable zij is created with yij and mij alternating with one
another. In addition, two new variables (s yij and smij ) are created to indicate
whether the outcome variable is y or m. Models (11.10) and (11.11) can then
be combined as in (11.13).
( )
zij = a0j smij + a1j smij x1ij + b0j s yij + b1j (s yij m1ij ) + c1j (s yij x1ij ) + e zij (11.13)
This model allows for the estimation of the full set of fixed and random effects
(intercepts and slopes) associated with the mediation model. The reader
interested in a more detailed description of this approach to restructuring the
data and estimating the multilevel mediation model is referred to Bauer et al.
(2006). In R, this restructuring and estimation can be done using the
multilevelmediation library as described below.
When considering mediation models in the context of multilevel data, it is
also important to denote the levels of each variable in the analysis. For
example, if the outcome, mediator, and predictor variables are all at level 1,
the resulting model is said to involve 1-1-1 mediation. When the predictor is
measured at level 2 and both the mediator and outcome are level-1 variables,
the resulting model is 2-1-1, whereas when both the predictor and mediator
are at level 2 and the outcome is at level 1, the model is denoted as 2-2-1.
Currently, R software is limited to fitting the 1-1-1 multilevel mediation
model, which will therefore be the focus of this chapter.
As is true for the single-level mediation model, because it cannot be
assumed to follow a normal distribution, inference for the indirect effect
should be conducted using the bootstrap (Shrout & Bolger, 2002). However,
in the context of multilevel data, bootstrap sampling is more complicated
than in the single level case. The key complication comes in the decision of
whether to resample at level 1, level 2, or both levels 1 and 2. When data are
resampled at level 2, all of the level-1 units within the selected level-2 unit
will be included in the analysis. Conversely, when data are resampled at
level-1 only, the level-2 units associated with the selected cases are carried
into the analysis but not sampled themselves. Finally, it is possible to draw
bootstrap samples at both level 2 and level 1, in which case first level-2 units
are selected and then resampling of level-1 units within the selected level-2
units are resampled with replacement. The choice of which approach to use
for a given problem should be based upon how the initial sampling was
266 Multilevel Modeling Using R
carried out. For example, if the study design involved randomly sampling
level-2 units (e.g., schools) and then all individuals within the selected
clusters were included in the study (e.g., all students in the school
participated), then bootstrapping at level 2 only would be appropriate. If
random sampling was carried out at both levels (e.g., schools are randomly
sampled and then students within randomly selected schools are randomly
sampled), bootstrapping at level 1 and level 2 should be used. Finally, if
individuals at level 1 were randomly selected and level-2 membership is a
feature of those individuals, but not a basis for sampling, the researcher can
use bootstrapping at level 1.
In order to demonstrate the fitting of multilevel mediation models using R,
we will refer back to the traffic fatalities data that have been the primary focus
so far in this chapter. In this case, we will examine a model in which the
outcome variable is the fatality rate, the predictor is the tax on a case of beer,
and the mediator is the income per capita. Each of these variables was
measured at level 1, meaning that we are interested in fitting a 1-1-1
mediation model. Prior to using the correct multilevel approach, let’s first
consider the naïve single-level mediation model, ignoring the level-2 variable
(state), which we can do using the processprocess macro (https://
processmacro.org/index.html). Prior to fitting the model, we will create a
subset of data containing only the variables of interest and then standardize
them.
Fatalities.subset<-
Fatalities[,c("beertax","fatalities_per_mile","income","state")]
Fatalities.subset$beertax.z<-scale(Fatalities.subset$beertax)
Fatalities.subset$fatalities.z<-
scale(Fatalities.subset$fatalities_per_mile)
Fatalities.subset$income.z<-scale(Fatalities.subset$income)
We can now fit the model using the process macro. To do so, we need to
specify the dataframe being used, the outcome (y), predictor (x), and
mediator variables (m), and the model that we wish to fit. Model 4
corresponds to the partially mediated model in Figure 11.1.
br.
ft. in.
FIGURE 11.1
Partial mediation model.
*******************************************************
Model : 4
Y : fatalities.z
X : beertax.z
M : income.z
*******************************************************
Model Summary:
R R-sq MSE F df1 df2 p
0.3975 0.1580 0.8445 62.6838 1.0000 334.0000 0.0000
Model:
coeff se t p LLCI ULCI
constant −0.0000 0.0501 −0.0000 1.0000 −0.0986 0.0986
beertax.z −0.3975 0.0502 −7.9173 0.0000 −0.4963 −0.2988
*******************************************************
Outcome Variable: fatalities.z
Model Summary:
R R-sq MSE F df1 df2 p
0.5547 0.3077 0.6964 74.0115 2.0000 333.0000 0.0000
Model:
coeff se t p LLCI ULCI
constant −0.0000 0.0455 −0.0000 1.0000 −0.0896 0.0896
beertax.z 0.0495 0.0497 0.9965 0.3197 −0.0482 0.1473
income.z −0.5332 0.0497 −10.7303 0.0000 −0.6309 −0.4354
Direct effect of X on Y:
effect se t p LLCI ULCI
0.0495 0.0497 0.9965 0.3197 −0.0482 0.1473
Indirect effect(s) of X on Y:
Effect BootSE BootLLCI BootULCI
income.z 0.2119 0.0320 0.1565 0.2821
library(multilevelmediation)
library(boot)
Fatalities.modmed.z<-modmed.mlm(Fatalities.subset, L2ID=
"state", X="beertax.z", Y="fatalities.z", M="income.z",
random.a=FALSE, random.b=FALSE, random.cprim=FALSE,
control=list(opt="REML"))
We will use the function modmed.mlm, and save the output in the object
Fatalities.modmed.z. First, we must indicate the name of the dataframe
that contains the variables of interest. We then define the variable that denotes
level-2 membership using the L2ID subcommand, followed by the predictor
(X), the outcome (Y),and the mediator (M). Next, we must indicate whether we
want to include random effects for the slopes relating the predictor to the
mediator (random.a), the mediator to the outcome (random.b), and the
predictor to the outcome (random.cprim). In this example, we have set
these to FALSE, meaning that the only random effects are for the intercepts.
We can obtain the output using the summary command.
summary(Fatalities.modmed.z$model)
Formula: ~0 + Sm + Sy | L2id
270 Multilevel Modeling Using R
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | Sm
Parameter estimates:
0 1
1.000 0.808
Fixed effects: as.formula(fixed.formula)
Value Std.Error DF t-value p-value
Sm 0.000 0.1267 620 0.00 1.000
Sy 0.000 0.1103 620 0.00 1.000
SmX −0.552 0.1027 620 −5.37 0.000
SyX 0.158 0.1024 620 1.54 0.124
SyM −0.285 0.0626 620 −4.56 0.000
Correlation:
Sm Sy SmX SyX
Sy −0.323
SmX 0.000 0.000
SyX 0.000 0.000 −0.233
SyM 0.000 0.000 −0.015 0.300
interest. However, the standard errors for the correctly specified multilevel
model are larger than those for the naïve model, which is to be expected, per
our discussion in Chapter 2.
The random intercept standard deviations appear under the Random
effects heading. These values are 0.867 for the mediator model intercept,
0.744 for the outcome model intercept, and 0.461 for the level-1 variance
(residual). The correlation between the random intercept effects is −0.336,
meaning that states with higher variability in incomes over time tended to
have lower variability in the number of fatalities over time.
As was the case for single-level mediation, the standard error for the
indirect effect cannot be assumed to follow a normal distribution, thereby
making the standard hypothesis tests inappropriate. Therefore, we will use
the bootstrap to make inferences about this term, as described above. In this
case, we will use the level-2 bootstrap in order to retain all of the years for a
given state. If we were to use some type of level-1 bootstrapping, the
temporal nature of the data would be disrupted when only some years were
selected for a given state. The R command needed to conduct the level-2
bootstrap appears below.
##LEVEL-2 BOOTSTRAP##
boot.Fatalities_L2.z <- boot(Fatalities.subset,
statistic = boot.modmed.mlm, R = 1000,
L2ID="state", X="beertax.z", Y="fatalities.z", M="income.z",
random.a = FALSE, random.b = FALSE, random.cprime = FALSE,
type = "all", boot.lvl = "2",
control=list(opt="REML")
)
We will use the boot command from the boot library and save the results
in the object boot.Fatalities_L2.z. We need to specify the dataframe
and the output that we’d like to save, which is the full model (boot.-
modmed.mlm). We requested R=1,000 bootstrap samples, and then need to
specify the cluster identifier (L2ID), and the predictor (X), outcome (Y), and
mediator (M) variables, as well as whether we want any random effects, as
was described above for the modmed.mlm function. Finally, we indicate the
level of the bootstrap, in this case level 2. This value would take 1 for the
level-1 bootstrap and both for the level-1 and level-2 combined bootstrap.
We can extract the effects and confidence intervals, as below.
extract.boot.modmed.mlm(boot.Fatalities_L2.z, type =
"indirect")
$CI
2.5% 97.5%
0.0617 0.3054
272 Multilevel Modeling Using R
$est
[1] 0.157
$est
[1] -0.552
$est
[1] -0.285
extract.boot.modmed.mlm(boot.Fatalities_L2.z, type =
"cprime")
$CI
2.5% 97.5%
−0.081 0.527
$est
[1] 0.158
Multilevel Lasso
In some research contexts, the number of variables that can be measured (p)
approaches, or even exceeds the number of individuals on whom such
Additional Modeling Frameworks for Multilevel Data 273
N p
e2 = (yi yˆi )2 + ˆ . (11.14)
j
i =1 j =1
where
In summary, the goal of the lasso estimator is to eliminate from the model
those independent variables that contribute very little to the explanation of
the dependent variable, by setting their ˆ values to 0, while at the same time
retaining independent variables that are important in explaining y. The
optimal value is specific to each data analysis problem. A number of
approaches for identifying it have been recommended, including the use of
cross-validation to minimize the mean squared error (Tibshirani, 1996), or the
selection of that minimizes the Bayesian information criterion (BIC). This
latter approach was recommended by Schelldorfer et al. (2011), who showed
that it works well in many cases. Zhao and Yu (2006) also found the use of the
BIC for this purpose to be quite effective. With this approach, several values
of are used, and the BIC values for the models are compared. The model
with the smallest BIC is then selected as being optimal.
Schelldorfer, Bühlmann, and van de Geer (2011) described an extension of
the lasso estimator that can be applied to multilevel models. The multilevel
lasso (MLL) utilizes the lasso penalty function, with additional terms to
account for the variance components associated with multilevel models.
The MLL estimator minimizes the following function:
p
2, 2 ): 1 1 1
(yi yˆi ) ˆ
Q ( , = ln V + (yi yˆi )’V + j (11.15)
2 2 j =1
where
From (10.12), we can see that model parameter estimates are obtained with
respect to penalization of level-1 coefficients, and otherwise works similarly
to the single-level lasso estimator. In order to conduct inference for the MLL
model parameters, standard errors must be estimated. However, the MLL
algorithm currently does not provide standard error estimates, meaning
that inference is not possible. Therefore, interpretation of results from
analyses using MLL will focus on which coefficients are not shrunken to 0,
as we will see in the example below.
key aspect of successfully using the lasso involves the identification of the
optimal tuning parameter, lambda, which is generally recommended to
be done based on minimizing the BIC (Hatie, Tibshirani, & Wainwright,
2015). The following R code creates a set of 100 potential lambda values,
fits the lasso model using each of these, and then identifies the one that
yields the lowest BIC value.
The smallest BIC value (503.2942140) was associated with a lambda value of
0.0869749.
Once we identify the optimal tuning parameter value, we can then fit the
model. The syntax that we use here generally matches the general structure
demonstrated in Chapter 3. Notice that the major difference from the examples
in the earlier chapter is that we include an indication for the number of tuning
parameters (0.0869749), based on our search.
library(glmmLasso)
model11.2<-glmmLasso(distance ~ age + Sex,
rnd=list(Subject=~1), lambda=0.0869749, data=Orthodont)
summary(model11.2)
The resulting output for the fixed and random effects appears below.
Call:
glmmLasso(fix = distance ~ age + Sex, rnd = list(Subject = ~1),
data = Orthodont, , switch.NR = TRUE, final.re = TRUE, lambda =
0.0869749)
Fixed Effects:
Coefficients:
Estimate StdErr z.value p.value
(Intercept) 17.706713 0.374833 47.2390 < 2.2e−16 ***
276 Multilevel Modeling Using R
StdDev:
Subject
Subject 1.871619
The fixed effects parameter estimates are essentially the same as those for the
standard REML-based model, as are the hypothesis test results. These results
are not a surprise, given the small optimal value of the tuning parameter.
Thus, we would conclude that essentially no shrinkage was necessary in this
case. Similarly, the standard deviation of the random intercept was also quite
close to that for the standard estimator.
By way of demonstration, let’s apply a much larger value of lambda so
that we can see how greater shrinkage impacts the parameter estimates. For
this example, we will use a tuning parameter value of 50.
Call:
glmmLasso(fix = distance ~ age + Sex, rnd = list(Subject = ~1),
data = Orthodont, lambda = 50)
Fixed Effects:
Coefficients:
Estimate StdErr z.value p.value
(Intercept) 17.706713 0.374833 47.2389 < 2.2e−16 ***
age 0.660185 0.043033 15.3413 < 2.2e−16 ***
SexFemale −2.321023 0.762861 −3.0425 0.002346 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random Effects:
StdDev:
Subject
Subject 0.2970685
Notice that all of the parameter estimates are reduced with this greater
penalty, with the random intercept standard deviation being particularly
impacted.
Additional Modeling Frameworks for Multilevel Data 277
where
Given that there are p dependent variables in Model (1), the random effect
variances in a univariate multilevel model (i.e., variances of Uc and Ric )
become the covariance matrices for these model terms:
T = cov (Uc )
m k m k m k m
yicp = 0sp dsicp + 1sp x1sic dsicp + Uspc dsicp + Rsicp dsicp
s =1 p =1 s =1 p =1 s =1 p =1 s =1
(11.18)
This model yields hypothesis testing results for each of the response
variables, accounting for the presence of the others in the data. In order to
test the multivariate null hypothesis of group mean equality across all of
the response variables, we can fit a null multivariate model to the data for
which the independent variables are not included. The fit of this null
model can then be compared with that of the full model including the
independent variable(s) of interest to test the null hypothesis of no
multivariate group mean differences, if the independent variable is
categorical. This comparison can be carried out using a likelihood ratio
test. If the resulting p-value is below the threshold (e.g., α = 0.05), then we
would reject the null hypothesis of no group mean differences, because the
fit of the full model including the group effect was better than that of
the null model. The reader interested in applying this model using R is
encouraged to read the excellent discussion and example provided in
Chapter 16 of Snijders and Bosker (2012).
Additional Modeling Frameworks for Multilevel Data 279
y= + 2 3
0 1x + 2x + 3x (11.19)
where
y = dependent variable
x = independent variable
j = coefficient for model term j
The cubic spline then fits different versions of this model between each pair of
adjacent knots. The more knots in the GAM, the more piecewise polynomials
that will be estimated, and the more potential detail about the relationship
between x and y will be revealed. GAMs take these splines and apply them to
a set of one or more predictor variables as in (11.20).
yi = 0 + fj (xi ) + i (11.20)
where
N p
PSS = {yi 0 + fj (xi )}2 + j f j‘’ (tj )2 dtj (11.21)
i =1 j =1
280 Multilevel Modeling Using R
Here, yi is the value of the response variable for subject i, and j is a tuning
parameter for variable j such that j 0. The researcher can use j to control
the degree of smoothing that is applied to the model. A value of 0 results in
an unpenalized function and relatively less smoothing, whereas values
approaching ∞ result in an extremely smoothed (i.e. linear) function relating
the outcome and the predictors. The GAM algorithm works in an iterative
fashion, beginning with the setting of β0 to the mean of Y. Subsequently, a
smoothing function is applied to each of the independent variables in turn,
minimizing the PSS. The iterative process continues until the smoothing
functions for the various predictor variables stabilize, at which point final
model parameter estimates are obtained. Based upon empirical research, a
recommended value for j is 1.4 (Wood, 2006), and as such will be used in
this study.
GAM rests on the assumption that the model errors are independent of
one another. However, when data are sampled from a clustered design,
such as measurements made longitudinally for the same individual, this
assumption is unlikely to hold, resulting in estimation problems, particu-
larly with respect to the standard error (Wang, 1998). In such situations, an
alternative approach to modeling the data, which accounts for the clustering
of data points is necessary. The generalized additive mixed model (GAMM)
accounts for the presence of clustering in the form of random effects (Wang,
1998). GAMM takes the form:
yi = 0 + f j (xi ) + Zi b + i (11.22)
where
Other model terms in (4) are the same as in (2). GAMM is fit in the same
fashion as GAM, with an effort to minimize PSS, and the use of j to control
the degree of smoothing.
Model11.4<-gamm4(geread~s(npaverb), family=gaussian,
random=~(1|school), data=prime_time)
summary(Model11.4$mer)
Scaled residuals:
Min 1Q Median 3Q Max
−2.3364 −0.6007 −0.1976 0.3110 4.7359
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1032 0.3213
Xr s(npaverb) 2.0885 1.4452
Residual 3.8646 1.9659
Number of obs: 10765, groups: school, 163; Xr, 8
Fixed effects:
Estimate Std. Error t value
X(Intercept) 4.34067 0.03215 135.00
Xs(npaverb)Fx1 2.30471 0.25779 8.94
Here, we see estimates for variance associated with the school random
effect, along with the residual. The smoother also has a random component,
though it is not the same as a random slope for the linear portion of the
model, which we will see below. The fixed effects in this portion of
the output correspond to those that we would see in a linear model,
so that X(Intercept) is a standard intercept term, and Xs(npaverb)Fx1
is the estimate of a linear relationship between npaverb and geread. Here
we see that the t-value for this linear effect is 8.94, suggesting that there may
be a significant linear relationship between the two variables.
The nonlinear smoothed portion of the relationship can be obtained using
the following command.
summary(Model11.4$gam)
Family: gaussian
Link function: identity
Formula:
geread ~ s(npaverb)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.34067 0.03215 135 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
R-sq.(adj) = 0.28
lmer.REML = 45286 Scale est. = 3.8646 n = 10765
When reading these results, note that the value for the intercept fixed effect
and its standard error are identical to the results from the mer output. Of
more interest when using GAM is the nature of the smoothed nonlinear
term, which appears as s(npaverb). We can see that this term is
statistically significant, with a p-value well below 0.05, meaning that there
is a nonlinear relationship between npaverb and geread. Also note that
the adjusted R2 is 0.28, meaning that approximately 28% of the variance in
the reading score is associated with the model. In order to characterize the
nature of the relationship between the reading and verbal scores, we can
examine a plot of the GAM function.
Additional Modeling Frameworks for Multilevel Data 283
plot(Model11.4$gam)
1
s(npaverb,6.03)
-1
-2
-3
0 20 40 60 80 100
npaverb
The plot includes the estimated smoothed relationship between the two
variables, represented by the solid line, and the 95% confidence interval of the
curve, which appears as the dashed lines. The relationship between the two
variables is positive, such that higher verbal scores are associated with higher
reading scores. However, this relationship is not strictly linear, as we see a
stronger relationship between the two variables for those with relatively low
verbal scores, as well as those for relatively higher verbal scores, when
compared with individuals whose verbal scores lie in the midrange.
Just as with linear multilevel models, it is also possible to include random
coefficients for the independent variables. The syntax for doing so, along
with the resulting output, appears below.
Model11.5<-gamm4(geread~s(npaverb), family=gaussian,
random=~(npaverb|school), data=prime_time)
summary(Model11.5$mer)
Linear mixed model fit by REML ['lmerMod']
284 Multilevel Modeling Using R
Scaled residuals:
Min 1Q Median 3Q Max
−2.8114 −0.5995 −0.2001 0.3159 4.8829
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 2.414e−02 0.1554
npaverb 8.101e−05 0.0090 −1.00
Xr s(npaverb) 1.606e+00 1.2672
Residual 3.805e+00 1.9507
Number of obs: 10765, groups: school, 163; Xr, 8
Fixed effects:
Estimate Std. Error t value
X(Intercept) 4.32243 0.03308 130.671
Xs(npaverb)Fx1 2.18184 0.24110 9.049
The variance for the random effect of npaverb is smaller than the variance
terms for the other model terms, indicating its relative lack of import in this
case. We would interpret this result as indicating that the linear portion of
the relationship between npaverb and geread is relatively similar across
the schools.
summary(Model11.6$gam)
Family: gaussian
Formula:
geread ~ s(npaverb)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.32243 0.03308 130.7 <2e−16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Additional Modeling Frameworks for Multilevel Data 285
R-sq.(adj) = 0.28
lmer.REML = 45167 Scale est. = 3.8052 n = 10765
The smoothed component of the model is the same for the random
coefficient model as it was for the random intercept only model. This latter
result is further demonstrated in the plot, below. It appears to be very
similar to that for the non-random coefficient model.
plot(Model11.6$gam)
1
s(npaverb,5.74)
-1
-2
-3
0 20 40 60 80 100
npaverb
multilevel modeling framework. We elected to use the gamm4 library, but the
reader show also be aware of the mgcv library, which can also be used to fit
multilevel GAM models. In either case, our focus is typically on the smoothed
portion of the model, as it provides us with information about the nonlinear
relationships among the predictor and dependent variables, assuming that
such exists. It is also worth noting here that should the relationship be only
linear in nature, this will be reflected by the GAM in the form of a strictly
linear smoothed function, evidenced in the plot. Finally, there are a large
number of smoothing parameters that we can use to improve the fit of the
model, and hone in on a more accurate picture of the relationships among our
variables. For the examples appearing in this chapter, we used the default
settings for gamm4, but you should be aware that in practice it may be
beneficial to examine a number of different settings in order to obtain a more
accurate picture of the true model structure.
Summary
Our goal in this chapter was to provide the reader with a set of tools that
can be used to address a variety of research problems that researchers
frequently face in practice, and for which the standard regularization
techniques discussed in the earlier chapters may not be sufficient. For
example, in Chapter 11, we learned about two alternatives to the standard
MLMs that offer advantages in cases with small samples at level 2 (FEM)
and that can provide the researcher with information about the underlying
correlation structure (GEE). We then saw how mediation models, which
allow researchers to test complex models about interrelationships among a
set of variables can be applied to multilevel data structures in which the
predictor, mediator, and outcome variables are all measured at level 1.
Next, we turned our attention to models designed for high-dimensional
situations, in which the number of independent variables approaches (or
even surpasses) the number of observations in the data set. Standard
estimation algorithms will frequently yield biased parameter estimates and
inaccurate standard errors in such cases. Penalized estimators such as the
lasso can be used to reduce the dimensionality of the data statistically,
rather than through an arbitrary selection by a researcher of predictors to
retain. We also saw that multilevel regression models can be easily
extended to situations in which there are multiple dependent variables,
using a multivariate multilevel framework. In addition to multivariate data
structures and high dimensionality, we learned about models that are
appropriate for situations in which the relationships between the indepen-
dent and dependent variables are not linear in nature. There are a number
Additional Modeling Frameworks for Multilevel Data 287
of possible solutions for such a scenario, with our focus being on a spline-
based solution in the form of the GAM. This modeling strategy provides the
data analyst with a set of tools for selecting the optimal solution for a given
dataset, and for characterizing the nonlinearity present in the data both with
coefficient estimates and graphically. Thus, the purpose of this chapter was
to provide the researcher with extensions of the multilevel modeling
approach that can be applied to a wide set of research situations.
12
Advanced Issues in Multilevel
Modeling
The purpose of this chapter is to introduce a wide array of topics in the area
of multilevel analysis that do not fit neatly in any of the other chapters in
the book. We refer to these as advanced issues because they represent
extensions, of one kind or another, on the standard multilevel modeling
framework that we have discussed heretofore. In this chapter, we will
describe how the estimation of multilevel model parameters can be adjusted
for the presence of outliers using robust or rank-based methods. As we will
see, such approaches provide the researcher with powerful tools for handling
situations when data do not conform to the distributional assumptions
underlying the models that we discussed in Chapters 3 and 4. We will finish
out the chapter with a discussion of predicting level-2 outcome variables
with level-1 independent variables, and with a description of approaches
for power analysis/sample size determination in the context of multilevel
modeling.
N
i =1 (ei eij )2
Di = (12.1)
kMSr
where
There are no hard and fast rules for how large Di should be in order for us to
conclude that it represents an outlying observation. Fox (2016) recommends
that the data analyst flag observations that have Di values that are unusual
when compared to the rest of those in the dataset, and this is the approach
that we will recommend as well.
Another influence diagnostic, which is closely related to Di , is DFFITS.
This statistic compares the predicted value for individual i when the
full dataset is used ( ŷi ), against the prediction for individual i when
individual j is dropped from the data ( ŷij ). For individual j, DFFITS is
calculated as
yˆi yˆij
DFFITSj = (12.2)
MSEj hj
where
As was the case with Di , there are no hard and fast rules about how large
DFFITSj should be in order to flag observation j as an outlier. Rather, we
examine the set of DFFITSj and focus on those that are unusually large (in
absolute value) when compared to the others. One final outlier detection
tool for single-level models that we will discuss here is the COVRATIO,
which measures the impact of an observation on the precision of model
estimates. For individual i, this statistic is calculated as
k +1
1 n k 2 + Ei 2
COVRATIOi = (12.3)
(1 hi ) n k 1
Advanced Issues in Multilevel Modeling 291
where
Ei = studentized residual
n = total sample size.
Fox (2016) states that COVRATIO values greater than 1 improve the precision
of the model estimates, whereas those with values less than 1 decrease the
precision of the estimate. Clearly, it is preferable for observations to increase
model precision, rather than decrease it.
N 1
1 1
Hi , Fixed = Xi Xi‘ Vi Xi Xi’ Vi (12.4)
i =1
where
Note that in (10.4), subject refers to the level-2 grouping variable. Similarly,
the leverage values based on the random effects for subjects in the sample
can be expressed as
where
The actual fixed and random effects leverage values correspond to the
diagonal elements of Hi , Fixed and Hi , Random , respectively. The multilevel
analog of Cook's Di can then be calculated using the following equation:
N 1
1
DMi = ri’ (I Hi , Fixed ) 1Vi 1 Xi Xi‘ Vi 1 Xi Xi’ Vi 1 (I Hi , Fixed ) 1ri (12.6)
mSe2 i =1
where
ˆi
RVCi = 1 (12.7)
ˆ
Advanced Issues in Multilevel Modeling 293
where
When RVC is close to 0, the observation does not have much influence on
the variance component estimate, i.e., is not likely to be an outlier. As with
the other statistics for identifying potential outliers, there is not a single
agreed-upon cut-value for identifying outliers using RVC. Rather, observa-
tions that have unusual such values when compared to the full sample
warrant special attention in this regard.
We can see that there are some individual measurements at both ends of the
distance scale that are separated from the main body of measurements. It's
important to remember that this graph does not organize the observations
by the individual on whom the measurements were made, whereas in the
context of multilevel data, we are primarily interested in examining the data
at that level. Nonetheless, this type of exploration does provide us with
some initial insights into what we might expect to find with respect to
outliers moving forward.
Next, we can examine a boxplot of the measurements by subject gender
(Figure 12.2).
294 Multilevel Modeling Using R
Histogram of Distance
16 20 24 28 32
Distance
FIGURE 12.1
Histogram showing the distance, measured in millimeters, from the center of the pituitary to
the pterygomaxillary fissure.
The distribution of distance is fairly similar for males and females, though
there is one measurement for a male subject that is quite small when compared
with the other male distances, and indeed is small when compared to the
female measurements as well. For this sample, typical male distance measure-
ments are somewhat larger than those for females. Finally, we can examine the
relationship between age and distance using a scatterplot (Figure 12.3).
Advanced Issues in Multilevel Modeling 295
28
Sex
24
20
16
Male Female
Distance
FIGURE 12.2
Boxplot of the measurements of distance by sex.
Here, we can see that there are some relatively low distance measure-
ments at ages 8, 12, and 14, and relatively large measurements at ages 10
and 12.
As we noted above, outlier detection in the multilevel modeling context is
focused on level 2, which is the child being measured. In order to calculate the
various statistics described above, we must first fit the standard multilevel
model, which we do with lmer. We are interested in the relationships between
age, sex, and the distance measure.
296 Multilevel Modeling Using R
28
Distance
24
20
16
8 10 12 14
Age
FIGURE 12.3
Scatterplot showing the relationship between age and distance.
Below is the portion of the output showing the random and fixed effects
estimates.
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 3.267 1.807
Residual 2.049 1.432
Number of obs: 108, groups: Subject, 27
Advanced Issues in Multilevel Modeling 297
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 17.70671 0.83392 99.35237 21.233 < 2e−16 ***
age 0.66019 0.06161 80.00000 10.716 < 2e−16 ***
SexFemale −2.32102 0.76142 25.00000 −3.048 0.00538 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These results indicate that measurements made on the children when they
were older were larger, and measurements made on females were smaller.
In order to obtain the influence statistics for this problem, we will use the
HLMdiag package in R. The following two commands are used to calculate
the outlier diagnostics that are described above. The first line involves the
deletion of each subject in the sample, in turn, and then the refitting of
model10.1.lmer. The second line then commands R to calculate the influence
diagnostics and save them in the output object subject.diag.1.
library(HLMdiag)
Model12.1.influence <- hlm_influence(model = model12.1.lmer,
level = "Subject")
print(Model12.1.influence, n = 27)
13
26
27
10
1
Based on this graph, it appears that subjects M13, F10, F11, and M10 all
have Cook's D values that are unusual when compared to the rest of the
sample.
13
27 26
10
1
The same four observations had unusually large MDFFITS values, further
suggesting that they may be potential outliers. Finally, with regard to the
fixed effects, we can examine the COVRATIO to see whether any of the
observations decrease model estimate precision.
Recall that observations with values less than 1 are associated with decreases
in estimate precision. None of the observations in this sample had values less
than 1.
Potential outliers with regard to the random effects can also be identified
using this simple graphical approach. First, we need to use the hlm_in-
fluence function and the approx.=FALSE option, which allows a full
refitting of the data.
300 Multilevel Modeling Using R
dotplot_diag(Model12.1.influence2$rvc.sigma2, name =
"rvc", cutoff = "internal")
As with Cook's D and MDFFITS, observations with unusual RVC values are
identified as potential outliers, meaning that M9 and M13 would be likely
candidates with regard to the error variance. In terms of the random intercept
variance component, which appears below, M10, F10, and F11 were the
potential outlying observations.
Taken together, these results indicate that there are some potential outliers
in the sample. Four individuals, M13, F10, F11, and M10 all were associated
with statistics, indicating that they have outsized impacts on the fixed
effects parameter estimates and standard errors. For the random error
Advanced Issues in Multilevel Modeling 301
13
22
26
23
16
11
21 9
10
23
effect, M9 and M13 were both flagged as potential outliers, whereas for the
random intercept variance, M10, F10, and F11 were signaled as potential
outliers. Notice that there is quite a bit of overlap in these results, with F10,
F11, M10, and M13 showing up in multiple ways as potential outliers.
Having identified these values, we must next decide how we would like to
handle them. As we have already discussed, it is recommended that, if
possible, an analysis strategy designed to appropriately model data with
outliers be used, rather than the researcher simply removing them
(Staudenmayer et al., 2009). Indeed, in the current example, if we were to
remove the four subjects who were identified as potential outliers using
multiple statistics, we would reduce our already somewhat small sample size
from 27 down to 23. Thus, modeling the data using the full sample would
seem to be the more attractive option. In the following section, we will discuss
estimators that can be used for modeling multilevel data when outliers may
be present. After discussing these from a theoretical perspective, we will then
demonstrate how to employ them using R.
Xi Zj Z‘j + j Zj
yi , U0j ~N , (12.8)
0 Z’j
Advanced Issues in Multilevel Modeling 303
where
Xi Zj Z‘j + j Zj
yi , U0j ~t , ,v (12.9)
0 Z’j
where
(JR; Kloke et al., 2009). In JR, the raw scores of the dependent variable are
replaced with their ranks based on a nondecreasing score function such as
the Wilcoxon (Wilcoxon, 1945). Assuming a common marginal distribu-
tion of level-1 errors ( ij ) across level-2 units, estimation of the fixed effects
( 1, 00) is based on Jaekel's (1972) dispersion function:
ˆ = Argmin Y Xˆ (12.10)
where
Y = dependent variable
X = matrix of independent variable values
ˆ = matrix of estimates of the fixed effects for the model
N
Y Xˆ = [R (yij yˆij )](yij yˆij )
i =1
than 50 level-2 units, JR_SE yielded somewhat larger standard errors than
was the case for JR_CS, thereby leading to more conservative inference with
regard to the statistical significance of the parameter estimates. Kloke and
McKean (2013) also found that when the exchangeability assumption was
violated, JR_CS standard error estimates were inflated, also reducing the
power for inference regarding these parameters. Given this combination of
results, Kloke and McKean recommended that JR_SE be used as the default
method for estimating model parameter standard errors. However, when
the level-2 sample size is small, researchers should use JR_CS, unless they
know that the exchangeability assumption has been violated.
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 3.267 1.807
Residual 2.049 1.432
Number of obs: 108, groups: Subject, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 17.70671 0.83392 99.35237 21.233 < 2e−16 ***
age 0.66019 0.06161 80.00000 10.716 < 2e−16 ***
SexFemale −2.32102 0.76142 25.00000 −3.048 0.00538 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
When we ignore the potential outliers, we would conclude that age has a
statistically significant positive relationship with distance, and that, on
average, females have smaller distance measures than do males. In addition,
the variance component associated with subject is somewhat larger than
that associated with error, indicating that there is a nontrivial degree of
difference in distance measurements among individuals.
306 Multilevel Modeling Using R
Now let's first fit this model using an approach based on ranks of the data,
rather than the raw data itself. In order to do this, we will need to install the
jrfit library from github. The commands for this installation appear below.
install.packages("devtools")
library(devtools)
install_github("kloke/jrfit")
We would then use the library command to load jrfit. After doing this, we
will need to create a matrix (X) that contains the independent variables of
interest, age and sex.
library(jrfit)
library(quantreg)
X<-cbind(Orthodont$age, Orthodont$Sex)
We are now ready to fit the model. Recall that there are two approaches for
estimating standard errors in the context of the rank-based approach, one
based on an assumption that the covariance matrix of within-subject errors is
compound symmetric, and the other using the sandwich estimator. We will
employ both for this example. First, we will fit the model with the compound
symmetry approach to standard error estimation.
Model12.2.cs<-jrfit(X,Orthodont$distance,
Orthodont$Subject,var.type=‘cs’)
summary(Model12.2.cs)
Coefficients:
Estimate Std. Error t-value p.value
X1 20.163737 1.421843 14.1814 < 2.2e−16 ***
X2 0.625000 0.063482 9.8454 < 2.2e−16 ***
X3 -2.163741 0.812548 −2.6629 0.008979 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that in these results, X1 denotes the intercept, X2 subject age, and X3
subject Sex. These results are quite similar to those using the standard
multilevel model assuming that the data are normally distributed. The
results for the sandwich estimator standard errors appear below.
Model12.2.sandwich<-jrfit(X,Orthodont$distance, Orthodont
$Subject,var.type=‘sandwich’) summary(model12.2.sandwich)
Coefficients:
Estimate Std. Error t-value p.value
Advanced Issues in Multilevel Modeling 307
The standard errors produced using the sandwich estimator were some-
what larger than those assuming compound symmetry, though in the final
analysis, the results concerning relationships between the independent
variables and the response were qualitatively the same. In summary, then,
the standard errors yielded by the rank-based approach were somewhat
larger than those produced by the standard model, though they did not
differ substantially in this example. In addition, the results of the three
models all yielded the same overall findings.
In order to fit the robust estimators, we will need to use the robustlmm
library in R. This package utilizes an approach based on M-estimators
(Huber, 1964) and the Design Adaptive Scale method described by Koller
and Stahel (2011). In order to fit the model based on the t distribution, the
following command is used.
library(robustlmm)
Model12.3<-rlmer(distance ~ age + Sex + (1 | Subject),
Orthodont) summary(Model12.3)
Scaled residuals:
Min 1Q Median 3Q Max
−5.7208 −0.5593 0.0054 0.5171 5.2646
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 2.868 1.694
Residual 1.230 1.109
Number of obs: 108, groups: Subject, 27
Fixed effects:
Estimate Std. Error t value
(Intercept) 18.10603 0.70622 25.638
age 0.60831 0.04895 12.427
SexFemale −2.10075 0.71588 −2.934
308 Multilevel Modeling Using R
The results from the robust estimator were generally similar to those for the
standard REML-based approach. The standard errors for the robust model
were slightly smaller than for the standard method, and the t-values all
exceeded 2, leading us to conclude that the relationships for both indepen-
dent variables differed from 0. Thus, we would conclude that when children
were older, they had larger distance measurements, and females had lower
mean values than did males.
Given these results, the reader will correctly ask the question, which
of these methods should I use? There has not been a great deal of
research comparing these various methods with one another. However,
one study (Finch, 2017) did compare the rank-based, heavy-tailed,
and standard multilevel models with one another, in the presence of
outliers. Results of this simulation work showed that the rank-based
approaches yielded the least biased parameter estimates, and had
smaller standard errors than did the heavy-tailed approaches.
Certainly, those standard error results are echoed in the current
example, where the standard errors for the heavy-tailed approaches
were more than twice the size of those from the rank-based method.
Given these simulation results, it would seem that the rank-based results
may be the best to use when outliers are present, at least until further
empirical work demonstrates otherwise.
Advanced Issues in Multilevel Modeling 309
xig = g + ig (12.11)
310 Multilevel Modeling Using R
where
The relationship between the level-2 dependent variable and the latent
level-2 predictor variable can then be written as
yg = 0 + 1 g + g (12.12)
where
0 = intercept.
In the estimation of this model, the best linear unbiased predictors (BLUPs)
of the group means must be used in the regression model, rather than the
standard unadjusted means, in order for the regression coefficient estimates
to be unbiased. Croon and van Veldhoven provide a detailed description of
how this model can be fit as a structural equation model. We will not delve
into these details here, but do refer the interested reader to this article. Of
key import here is that we see the possibility of linking these individual-
level variables with a group-level outcome variable, using the model
described above.
The micro-macro model can be fit in R using the MicroMacroMultilevel
library. The following R code demonstrates first how the variables are
appropriately structured for use with the library, followed by how the
model is fit and subsequently summarized.
micromacro.summary(model.output)
Call:
micromacro.lm( y.prime_time ~ BLUP.prime_time.nomiss.
gemath + BLUP.prime_time.nomiss.gelang + z.mean, ...)
Residuals:
Min 1Q Median 3Q Max
−12.25487 −2.442717 −0.0003431752 2.850811 12.2095
Coefficients:
Estimate Uncorrected S.E. Corrected S.E.
df t
(Intercept) −190.52742279 30.98161003 28.24900466
156 −6.7445712
BLUP.prime_time.nomiss.gemath −0.01019005 0.02838026 0.02303021
156 −0.4424645
BLUP.prime_time.nomiss.gelang 0.01604510 0.02405509 0.01940692
156 0.8267722
z.mean 3.04344765 0.32214229 0.29422655
156 10.3438919
Pr(>|t|) r
(Intercept) 2.824611e−10 0.47514746
BLUP.prime_time.nomiss.gemath 6.587660e−01 0.03540331
BLUP.prime_time.nomiss.gelang 4.096290e−01 0.06605020
z.mean 2.012448e−19 0.63783643
---
Residual standard error: 4.29361 on 156 degrees of freedom
Multiple R-squared: 0.3649649621, Adjusted R-squared: 0.3527527498
F-statistic: 29.88525 on 3 and 156 DF, p-value: 0
From these results, we can see that neither level-1 independent variable was
statistically significantly related to the dependent variable, whereas the
level-2 predictor, average daily attendance, was positively associated with
the school mean cognitive skills index score. The total variance in the
cognitive index that was explained by the model was 0.353, or 35.3%.
library(simr)
x1 <- 1:20
cluster <- letters[1:25]
The preceding code specifies the level-1 (x1) and level-2 (cluster) sample
sizes, and places them in the object sim_data using the expand.grid
function, which is part of base R. We then specify the fixed effects for the
intercept (1) and the slope (0.5) for our particular problem. The variance of
the random intercept term is then set at 0.25 for this example, and the
covariance matrix of the random effects is specified in the object V2. The
standard deviation for the error term is 1. Finally, the makeLmer command
is used to generate our model, which appears below. This output is useful
for checking that everything we think we specified is actually specified
correctly, which is the case here.
In order to generate the simulated data and obtain an estimate of the power
for detecting the coefficient of 0.5 given the sample size parameters we have
put in place, we will use the powerSim command from the simr library.
We must specify the model and the number of simulations that we would
like to use. In this case, we request 100 simulations. The results of these
simulations appear below.
powerSim(sim_model1, nsim=100)
Power for predictor ‘x1’, (95% confidence interval):
100.0% (96.38, 100.0)
Time elapsed: 0 h 0 m 20 s
These results indicate that for our sample of 500 individuals nested within
25 clusters, the power of rejecting the null hypothesis of no relationship
between the independent and dependent variables when the population
coefficient value is 0.5 is 100%. In other words, we are almost certain to
obtain a statistically significant result in this case with the specified sample
sizes.
How would these power results change if our sample size was only 100
(10 individuals nested within each of 10 clusters)? The following R code will
help us to answer this question.
x1 <- 1:10
cluster <- letters[1:10]
powerSim(sim_model1, nsim=100)
Time elapsed: 0 h 0 m 13 s
Clearly, even for a sample size of only 100, the power remains at 100% for
the slope fixed effect.
316 Multilevel Modeling Using R
FIGURE 12.4
Power Curve using simr Function.
Rather than simulate the data for one sample size at a time, it is also possible
using simr to simulate the data for multiple sample sizes, and then plot the
results in a power curve. The following R code will take the simulated model
described above (sim_model1), and do just this (Figure 12.4).
This curve demonstrates that, given the parameters we have specified for our
model, the power for identifying the coefficient of interest as being different
from 0 will exceed 0.8 when the number of individuals per cluster is four or
more. There are a number of settings for these functions that the user can
adjust in order to tailor the resulting plot to their particular needs, and we
encourage the interested reader to investigate those in the software docu-
mentation. We do hope that this introduction has provided you with the tools
necessary to further investigate these additional functions, however.
Advanced Issues in Multilevel Modeling 317
Summary
Our focus in Chapter 12 was on a variety of extensions to multilevel
modeling. We first described how researchers go about identifying potential
outliers in both the single and multilevel data contexts. We then examined
multilevel models for use with unusual or difficult data situations, particu-
larly when the normality of errors cannot be assumed, such as in the presence
of outliers. We saw that there are several options available to the researcher in
such instances, including estimators based on heavy-tailed distributions, as
well as rank-based estimators. In practice, we might use multiple such
options and compare their results in order to develop a sense as to the nature
of the model parameter estimates. We concluded the chapter with a
discussion of the micro-macro modeling problem, in which level-1 variables
are to serve as predictors of level-2 outcome variables, and with a review of
how simulation can be used to conduct a power analysis in the multilevel
modeling context. Both of these problems can be addressed using libraries in
R, as we have demonstrated in this chapter.
References
Agresti, A. (2002). Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons,
Publications.
Aiken, L.S. & West, S.G. (1991). Multiple Regression: Testing and Interpreting
Interactions. Thousand Oaks, CA: Sage.
Anscombe, F.J. (1973). Graphs in Statistical Analysis. American Statistician, 27(1),
17–21.
Bickel, R. (2007). Multilevel Analysis for Applied Research: It’s Just Regression! New
York: Guilford Press.
Bijmolt, T.H.A., Paas, L.J., & Vermunt, J.K. (2004). Country and Consumer
Segmentation: Multi-level Latent Class Analysis of Financial Product
Ownership. International Journal of Research in Marketing, 21, 323–340.
Breslow, N. & Clayton, D.G. (1993). Approximate Inference in Generalized Linear
Mixed Models. Journal of the American Statistical Association, 88, 9–25.
Bryk, A.S. & Raudenbush, S.W. (2002). Hierarchical Linear Models. Newbury Park,
CA: Sage.
Buhlmann, P. & van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods,
Theory and Applications. Berlin, Germany: Springer-Verlag.
Crawley, M.J. (2013). The R Book. West Sussex, UK: John Wiley & Sons, Ltd.
Croon, M.A. & van Veldhoven, M.J. (2007). Predicting Group-Level Outcome
Variables from Variables Measured at the Individual Level: A Latent
Variable Multilevel Model. Psychological Methods, 12(1), 45–57.
de Leeuw, J. & Meijer, E. (2008). Handbook of Multilevel Analysis. New York: Springer.
Dillane, D. (2005). Deletion Diagnostics for the Linear Mixed Model. Ph.D. Thesis,
Trinity College, Dublin.
Falk, C.F., Vogel, T.A., Hammami, S., & Miočević, M. (2023). Multilevel Mediation
Analysis in R: A Comparison of Bootstrap and Bayesian Approaches. Behavior
Research Methods.
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Los Angeles: Sage.
Finch, W.H. (2017). Multilevel Modeling in the Presence of Outliers: A Comparison
of Robust Estimation Methods. Psicologica, 38, 57–92.
Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models. Thousand
Oaks, CA: Sage.
Gavin, M. B. & Hofmann, D. A. (2002). Using Hierarchical Linear Modeling to
Investigate the Moderating Influence of Leadership Climate. The Leadership
Quarterly, 13, 15–33.
Hardin, J. & Hilbe, J. (2003). Generalized Estimating Equations. Editorial Chapman
& Hall/CRC.
318
References 319
Hastie, T. & Tibshirani, R. (1990). Generalized Additive Models. London, UK: Chapman
and Hall.
Hofmann, D.A. (2007). Issues in Multilevel Research: Theory Development,
Measurement, and Analysis. In S.G. Rogelberg (Ed.). Handbook of Research
Methods in Industrial and Organizational Psychology (pp. 247–274). Malden, MA:
Blackwell.
Hogg, R.V. & Tanis, E.A. (1996). Probability and Statistical Inference. New York: Prentice
Hall.
Huber, P.J. (1964). Robust Estimation of a Location Parameter. The Annals of
Mathematical Statistics, 35, 73–101.
Huber, P.J. (1967). The Behavior of Maximum Likelihood Estimates under
Nonstandard Conditions. In: Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability, University of California Press.
Berkeley. Vol. 1, 221–223.
Hox, J. (2002). Multilevel Analysis: Techniques and Applications. Mahwah, NJ: Erlbaum.
Iversen, G. (1991). Contextual Analysis. Newbury Park, CA: Sage.
Jaekel, L.A. (1972). Estimating Regression Coefficients by Minimizing the Dispersion
of Residuals. Annals of Mathematical Statistics, 43, 1449–1458.
Kline, R.B. (2016). Principles and Practice of Structural Equation Modeling (4t h ed.).
Guilford Press.
Kloke, J. & McKean, J.W. (2013). Small Sample Properties of JR Estimators. Paper
presented at the annual meeting of the American Statistical Association,
Montreal, QC, August.
Kloke, J.D., McKean, J.W., & Rashid, M. (2009). Rank-Based Estimation and
Associated Inferences for Linear Models with Cluster Correlated Errors.
Journal of the American Statistical Association, 104, 384–390.
Koller, M. (2016). robustlmm: An R Package for Robust Estimation of Linear Mixed-
Effects Models. Journal of Statistical Software, 75.
Kreft, I.G.G. & de Leeuw, J. (1998). Introducing Multilevel Modeling. Thousand Oaks,
CA: Sage.
Kreft, I.G.G., de Leeuw, J., & Aiken, L. (1995). The Effect of Different Forms of Centering
in Hierarchical Linear Models. Multivariate Behavioral Research, 30, 1–22.
Kruschke, J.K. (2011). Doing Bayesian Data Analysis. Amsterdam, Netherlands:
Elsevier.
Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989). Robust Statistical Modeling Using
the t Distribution. Journal of the American Statistical Association, 84, 881–896.
Ligan, K.-Y. & Zeger, S.L. (1986). Longitudinal Data Analysis Using Generalized
Linear Models. Biometrika, 73, 13–22.
Liu, Q. & Pierce, D.A. (1994). A Note on Gauss-Hermite Quadrature. Biometrika, 81,
624–629.
Lynch, S.M. (2010). Introduction to Applied Bayesian Statistics and Estimation for Social
Scientists. New York: Springer.
MacKinnon, D.P., Lockwood, C.M., Hoffman, J.M., West, S.G., & Sheets, V. (2002). A
comparison of methods to test mediation and other intervening variable effects
Psychological Methods, 7, 83–104.
McDonald, R.P. (1999). Test theory: A Unified Treatment. Lawrence Erlbaum Associates
Publishers.
320 References
McNeish, D. & Kelley, K. (2019). Fixed Effects Models versus Mixed Effects Models
for Clustered Data: Reviewing the Approaches, Disentangling the Differences,
and Making Recommendations Psychological Methods, 24, 20–35.
McNeish, D. & Wentzel, K.R. (2017). Accommodating Small Sample Sizes in Three-
Level Models When the Third Level is Incidental. Multivariate Behavioral
Research, 52, 200–215.
Mehta, P.D. & Neale, M.C. (2005). People Are Variables Too: Multilevel Structural
Equations Modeling Psychological Methods, 10, 259–284.
Muthén, B.O. (1989). Latent Variable Modeling in Heterogeneous Populations.
Psychometrika, 54, 557–585.
Muthén, B. & Asparouhov, T. (2009). Multilevel Regression Mixture Analysis.
Journal of the Royal Statistical Society Series A: Statistics in Society, 172, 639–657.
Pan, W. (2001). Akaike’s Information Criterion in Generalized Estimating Equations.
Biometrics, 57, 120–125.
Pinheiro, J., Liu, C., & Wu, Y.N. (2001). Efficient Algorithms for Robust Estimation in
Linear Mixed-Effects Models Using the Multivariate t Distribution. Journal of
Computational and Graphical Statistics, 10, 249–276.
Potthoff, R.F. & Roy, S.N. A Generalized Multivariate Analysis of Variance Model
Useful Especially for Growth Curve Problems. Biometrika, 51(3–4), 313–326.
R Development Core Team (2012). R: A Language and Environment for Statistical
Computing. Vienna, Austria: R Foundation for Statistical Computing.
Rogers, W.H. & Tukey, J.W. (1972). Understanding Some Long-Tailed Symmetrical
Distributions. Statistica Neerlandica, 26(3), 211–226.
Ryu, E. & West, S.G. (2009). Level-Specific Evaluation of Model Fit in Multilevel
Structural Equation Modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 16, 583–601.
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. New York: Springer.
Schall, R. (1991). Estimation in Generalized Linear Models with Random Effects.
Biometrika, 78, 719–727.
Schelldorfer, J., Buhlmann, P., & van de Geer, S. (2011). Estimation for High-
Dimensional Linear Mixed-Effects Models Using l1-Penalization. Scandinavian
Journal of Statistics, 38, 197–214.
Shrout, P.E. & Bolger, N. (2002). Mediation in Experimental and Nonexperimental
Studies: New Procedures and Recommendations Psychological Methods, 7,
422–445.
Snijders, T. & Bosker, R. (1999). Multilevel Analysis: An Introduction to Basic and
Advanced Multilevel Modeling, 1st edition. Thousand Oaks, CA: Sage.
Snijders, T. & Bosker, R. (2012). Multilevel Analysis: An Introduction to Basic and
Advanced Multilevel Modeling, 2nd edition. Thousand Oaks, CA: Sage.
Song, P.X.-K., Zhang, P., & Qu, A. (2007). Maximum Likelihood Inference in Robust
Linear Mixed-Effects Models Using Multivariate t Distributions. Statistica
Sinica, 17, 929–943.
Staudenmayer, J., Lake, E.E., & Wand, M.P. (2009). Robustness for General Design
Mixed Models Using the t-Distribution. Statistical Modeling, 9, 235–255.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the
Royal Statistical Society, Series B, 58, 267–288.
Tong, X. & Zhang, Z. (2012). Diagnostics of Robust Growth Curve Modeling Using
Student’s t Distribution. Multivariate Behavioral Research, 47(4), 493–518.
References 321
Tu, Y.-K., Gunnell, D., & Gilthorpe, M.S. (2008). Simpson’s Paradox, Lord’s Paradox,
and Suppression Effects are the Same Phenomenon – The Reversal Paradox.
Emerging Themes in Epidemiology, 5(2), 1–9.
Tukey, J.W. (1949). Comparing Individual Means in the Analysis of Variance.
Biometrics, 5(2), 99–114.
Vermunt, J.K. (2003). Multilevel Latent Class Models. Sociological Methodology, 33,
213–239.
Wang, Y. (1998). Mixed Effects Smoothing Spline Analysis of Variance. Journal of the
Royal Statistical Society B, 60(1), 159–174.
Wang, J. & Genton, M.G. (2006). The Multivariate Skew-Slash Distribution. Journal of
Statistical Planning and Inference, 136, 209–220.
Welsh, A.H. & Richardson, A.M. (1997). Approaches to the Robust Estimation of
Mixed Models. In G. Maddala and C.R. Rao (Eds.). Handbook of Statistics (vol.
15, pp. 343–384). Amsterdam: Elsevier Science B.V.
Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics
Bulletin, 1(6), 80–83.
Wolfinger, R. & O′Connell, M. (1993). Generalized Linear Mixed Models: A Pseudo-
Likelihood Approach. Journal of Statistical Computation and Simulation, 48,
233–243.
Wood, S.N. (2006). Generalized Additive Models: An Introduction with R. New York:
Chapman and Hall/CRC.
Wooldridge, J. (2004). Fixed Effects and Related Estimators for Correlated Random
Coefficient and Treatment Effect Panel Data Models. East Lansing: Department of
Economics, Michigan State University.
Yuan, K.-H. & Bentler, P.M. (1998). Structural Equation Modeling with Robust
Covariances. Sociological Methodology, 28, 363–396.
Yuan, K.-H., Bentler, P.M., & Chan, W. (2004). Structural Equation Modeling with
Heavy Tailed Distributions. Psychometrika, 69(3), 421–436.
Zhao, P. & Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of Machine
Learning Research, 7, 2541–2563.
Index
Note: Page numerals in italics refer to images, and those followed by the letter ‘n’
refer to notes.
322
Index 323