Data with different levels of aggregation

Hi all,

I want to fit a dataset that contains information on both the individual and the group level. The (binary) outcome and some input variables are on the individual level, some other input variables (including the one I’m most interested in, let’s call it “x”) are only available on the group level (as an average). I’m wondering what kind of model would be appropriate in that case.

I was thinking about running a logistic regression and modeling “x” on the individual level as a latent variable. In that case I would use the group average to construct an informative prior for “x”. I guess an alternative approach would be to use a binomial likelihood and aggregate the individual-level data I have. However, this feels like throwing away some important information. I’m not sure at all what’s best practice here and if there’s another solution I haven’t thought of.

Thanks in advance!

I think this will be very model/aggregation specific. As you said ideally you don’t want (and need not) discard any data from a different level of detail, but you also don’t want to double count to the extent that the data is redundant across levels.

1 Like

Thank you! Maybe more specifically: I have disease data, with test results (testing positive/negative for the disease) for each individual. I also know which center people were tested at, their age, their gender and the postcode area they live in. For the postcode areas I have a bunch of information, such as average annual air pollution. I expect quite some variation between test centers so would like to include them in the model (as random effects). I’m most interested in the association between the outcome and the variables which I only have available at the postcode level.

The approach you outline is standard in hierarchical models. If there are group-level covariates, you can use them to model group-level parameters the same way as if there are individual covariates, you can use them to model individual parameters.

To borrow an example style from Gelman and Hill’s regression book, consider the 50 states in the U.S. We might assume there’s a state-level effect on voting republican, then an individual effect. Let’s keep it simple and only worry about expectations, not error hierarchically.

\alpha_i \sim \textrm{normal}(\mu_{\text{state}[i]}, \sigma) for i \in \{ 0, \ldots, 49 \}.

This is an intercept only model. But suppose we have a state-level effect, like log average income, call that x^\text{inc}_i. Then we can have

\alpha_i \sim \textrm{normal}(\mu_{\text{state}[i]} + \alpha \cdot x^\text{inc}_i, \sigma).

If the distribution’s constrained, say to positive values or probabilities, we can transform just like in a GLM.

Hi, thanks a lot! I’m a bit confused, are you including an individual effect in your example? I guess \alpha_i is the outcome for each state? In my case, I have the outcome on the individual level but some of the covariates on the group-level. Sorry if I’m a bit slow here.

No, this is just the state-level effect. Usually in a generalized linear model you’d add in other effects. So if there’s something like personal income z_n^\text{inc} for individuals n and if we assume that the state of individual is given by i = \text{state}[n] \in \{ 0, \ldots, 49 \}, then the linear predictor becomes \alpha_{\text{state}[n]} + \beta \cdot z_n^\text{inc}. You could also introduce varying effects for individuals, say \gamma_n, and add those to get get a linear predictor \alpha_{\text{state}[n]} + \beta \cdot z_n^\text{inc} + \gamma_n. Now you have a mixed effect model and if there’s a non-linear link (e.g., Poisson for count regressions or log odds for logistic regressions) to your observations y_n you have a full non-linear mixed effects model.

Ok, thanks! My problem is that I have (using your example) average income on the state level but I have the outcome and some other predictors on the individual level. I don’t really want to aggregate the data to have the outcome on the state level as well, but instead keep all the individual data and somehow include the aggregated (state) data in the same model. I understand that it may make sense to include the state itself to account for a state-level effect. But it does seem wrong to include the average income of the state in a regression with an outcome on the individual level (or am I wrong here?). So afaik this model wouldn’t be the right way to approach the problem:

y_j ∼ N(\alpha_{state[j]} + \beta_1 ⋅ x_{state[j]}^{inc} + \beta_2 ⋅ x_{j}^{something}, \sigma) with j ∈ {1, ..., n}

where n is the number of individuals in my data set.

I wouldn’t say there’s a right or wrong, but doing what I suggested is the standard approach to dealing with hierarchical effects. The idea is that there are additive effects from the state someone lives in, so they just go into the regression.

It may help to consider a simple nested hierarchy, like just people nested within states. In that case, you can think of the individual random effects being centered around the state random effects, so that’d be

\gamma_n \sim \textrm{normal}(\alpha_{\text{state}[n]}, \sigma^\gamma)

You’ll see that this is equivalent to using predictor \gamma_n + \alpha_{\text{state}[n]} if you take

\gamma_n \sim \textrm{normal}(0, \sigma^\gamma).

Models typically get formulated in the latter way because it’s easier to generalize to cross-cutting hierarchies, such as when you care not just about the 50 states, but also age ranges, ethnicity, religion, sex, etc. In the more complex case, things aren’t cleanly nested.

I would highly recommend Gelman and Hill’s multilevel regression book. It’s unfortunately all in R, but it’s a great introduction to how to formulate these kinds of models. If you have a stronger math and stats background, I’d recommend chapter 15 of Gelman et al.'s Bayesian Data Analysis (free from the BDA home page), which covers similar ground much more quickly. For example, equation (15.2) has an example very much like I gave above. Also check out section 15.2 for varying slope and intercept model (what I showed above was only a varying intercept model). Section 16.4 does a GLM version for modeling police stops, where there are also group-level effects.

Thank you (also for the recommendations)! I had a look at BDA and maybe section 9.4 Hierarchical decision analysis for home radon matches my problem best (at least in terms of the data structure).

So I think what I’ll do is include the state (/county/geographical area) as a random effect but this effect itself being a linear combination like so

y_j∼N(α_{state[j]}+β_1⋅x^{something}_{j},σ) with j ∈ 1,...,n

α_{state}∼N(\gamma_0 + \gamma_1⋅x^{inc}, \tau)

Does that make sense? Then we just have to make sure to not interpret the estimated effect of income as the effect of an individual’s income on the test result but rather as the effect of living in a poor or wealthy state.

Still wondering about the latent variable approach though and if this would be appropriate as well. In that case we could actually be estimating the effect of an individual’s income on the test result. But I guess we would need a good prior for the variation of income within each state.

Almost. The x^\text{inc} is presumably a state level covariate of average state income that should be subscripted with a state index. So I think you meant

\alpha_s \sim \text{normal}(\gamma_0 + \gamma_1 \cdot x^\text{inc}_s, \tau),

where s \in \{ 0, \ldots, 49 \} is the state index.

If x^\text{something}_j is personal income for person j, then \beta_1 is the personal income effect, adjusted for average state income. Regression coefficients can only ever be interpreted in the context of which other adjustments are or are not being done.

The collective effect is just the linear predictor, where the expected value of the random variable Y_j is modeled as

\mathbb{E}[Y_j] = \gamma_0 + \gamma_1 \cdot x_{\text{state}[j]} + \beta_1 \cdot x_j^\text{something}.

It helps to unfold this way to check identifiability—this is good in that there’s only one global intercept.

If you adjusted for some other group level thing, you wouldn’t want to include the intercept there. That’s why it’s more common to see varying effects parameterized without intercepts, e.g.,

\phi_s \sim \textrm{normal}(\gamma_1 \cdot x^\text{inc}_s, \tau)

y_j \sim \textrm{normal}(\delta + \beta_1 \cdot x^\text{something}_j + \phi_s, \sigma)

The relation to the previous parameterization is that \delta = \gamma_0 and \alpha_s = \phi_s + \delta.

What’s that? The coefficients \alpha, \gamma, \beta are latent in the sense of being unobserved.

Ok, got it, thanks!

The latent variable approach (not sure if it’s the appropriate term) would be to assume that personal incomes for people living in state A are distributed around the average income for that state. So the personal income would be treated as an unobserved variable, and during model fitting this variable would be sampled from a distribution informed by the state average income and the variation of income within that state. The sampled values could then be used for the likelihood calculation, where we use personal income as a predictor variable - something like that. Would that make sense?

Oops, I forgot a subscript. We have

y_j \sim \textrm{normal}(\delta + \beta_1 \cdot x^\text{something}_j + \phi_{\text{state}[j]}, \sigma)

Regression coefficients can only be interpreted in light of other regression coefficients. In a linear regression, we have

\mathbb{E}[Y_j \mid x^\text{something}_j, \text{state}_j] = \delta + \beta_1 \cdot x^\text{something}_j + \phi_{\text{state}[j]}

Everything can be read relative to the state effect term \phi_{\text{state}[j]} in the linear predictor. Specifically, the interpretation of the remaining terms \delta + \beta_1 \cdot x^\text{something}_j is just the difference of the exected value of y_j from the state baseline prediction for state \text{state}_j,

\delta + \beta_1 \cdot x^\text{something}_j = \mathbb{E}[Y_j \mid x^\text{something}_j, \text{state}_j] - \phi_{\text{state}[j]}