rvlenth/emmeans

Inquiry about Mathematical Proof for "cells"-weighted EMMs equal to Marginal Means

jialianghua opened this issue · 8 comments

Hi Russell,

I hope you are doing well.

I read the documentation and vignettes of emmeans. And I found in the documents you stated that applying the weights = "cells" in a linear model essentially reproduces the raw marginal means of the data.

To delve deeper into this concept, I conducted some empirical tests using various datasets in R, which indeed supported the assertion you mentioned. The "cells"-weighted EMMs consistently mirrored the ordinary marginal means across all tested datasets. While the empirical evidence is compelling, I am keen to understand the theoretical foundation of this equivalence. I think I lack a clear intuition on this: When we apply weights = "cells" in a linear model, what we are doing is averaging the linear model predictions. In contrast, calculating a raw marginal average simply involves averaging the data. I don't have a good intuition about how these 2 equals. Could you provide a mathematical proof or an explanation that elucidates how the "cells" weighting of the EMMs in a linear model leads to this outcome? A theoretical perspective would greatly enhance my comprehension/intuition of the underlying principles.

Thank you so much for your contributions to the field and for your assistance with this query. I am a Biostatistical Data Analyst in NYC and your package really helped my work a lot. I look forward to your insightful response.

All the Best,
JH.

Please note that the vignette says

... for a model with all interactions, "cells"-weighted EMMs are the same as the ordinary marginal means of the data.

The proviso that the model include all interactions is key. Under that condition, the predicted values on the reference grid are the cell means $\bar y_{ijk\cdots}$. If we average over the subscript $i$ with weights $n_{ijk\cdots}$, we obtain
$$\frac{\sum_i n_{ijk\cdots}\cdot\bar y_{ijk...}}{\sum_i n_{ijk\cdots}} = \frac{y_{+jk\cdots}}{n_{+jk\cdots}} $$
where a $+$ in a subscript indicates that we have summed over that subscript. This is the same as the raw marginal mean for subscript combinations $jk\cdots$. This argument easily generalizes to averaging over other subscripts, or more than one subscipt.

OK thank you! I understand that we are discussing a linear model that includes all interactions now. While I am aware that this inquiry extends beyond the scope of your package, I am wondering if you can provide any mathematical proof or derivation demonstrating the predicted values from a linear model with all interactions on the reference grid are equal to the cell means?

The model with all interactions allows for a separate fitted value for each cell to be determined independently of any other cell, whenever there are data in that cell. Suppose that the observations in a particular cell are $y_1,y_2,\ldots,y_n$ and let the fitted value for that cell be $a$. Then the error sum of squares for that cell is $$\sum(y_i-1)^2 = \sum[(y_i-\bar y) + (\bar y - a)]^2 = \sum(y_i-\bar y)^2 + 2((\bar y -a)\sum(y_i - \bar y) + n(\bar y - a)^2$$ The second term is zero because $\sum y_i = n\bar y$, and the third is non-negative. To minimize the sum of squares (i.e., least-squares estimation), we require that the third term also be zero, and hence that $a = \bar y$.

OK, before you ask, suppose there is a covariate (numeric predictor) $x$ in the model that interacts with all the factors. Here we require to be using the default reference grid, i.e. that there is one reference value of $x$, namely $\bar x$. The model fits a separate regression line in each cell. Refer to a standard regression text where they show each regression line passes through the point of means $(\bar x, \bar y)$. For more than one covariate, expand this argument with one covariate at a time.

Thank you Russell! So if I am understanding right, can I summarize it this way?:

  1. In some sense, weights = "cells" gives us the estimated marginal expected outcomes given levels of one factor, without controlling for other factors. It basically reproduces the raw marginal means of the data in linear models with all interactions. It can be used to characterize the population.
  2. weights = "proportional" gives the model based estimated marginal expected outcomes given levels of one factor, controlling for the other factors. It is useful for quantifying effects in observational studies.
  3. weights = "outer" can also be used for characterizing the population. Since it is equivalent to making the cell weights equal to the expected frequencies in a chi-square test of independence, it can be useful for characterizing the population assuming the distributions of the factors being averaged over are independent of each other.

I think this is resolved, so am closing.

It's resolved. Thank you so much Russ!