interactions for zs() or categ() variables

Question

interactions for zs() or categ() variables

slhogle opened this issue 4 years ago · 4 comments

Hei,

Thanks for the nice software. I'm using lgpr v1.0.4 with rstantools 2.0.0, rstan 2.21.1, and R 4.0.2 I guess this is a feature request.

Is your feature request related to a problem? Please describe.
I have a longitudinal experiment with two treatment factors and I expect there to be an interaction between them. This is what I get when I create the model.

Invalid term with expressions: zs(predation)*zs(pseudomonas_hist). Note that each term can contain at most one categ() or zs() expression.

What is the reason why we can't have an interaction between two zs() or categ() kernels?

Describe alternatives you've considered
My workaround is to just manually encode the interaction as a dummy variable and then include it in the model with a zs(dummy_interaction)*gp(time) interaction. Is that the correct approach?

Describe the solution you'd like
Native support for categorical interactions if that is possible. Otherwise include in the support or tutorials a short section that explains why you can't have zs() interactions and what the correct approach should be if you have a categorical interaction in your experiment.

thanks for your help!
-shane

Answer 1 · 2020-10-26T23:12:21.000Z

Hei.

Thank you good question. Support for interactions like that could be added, but then one would not be able to directly interpret the relevance of the individual variables anymore. If we had zs(A)*zs(B)*gp(time) in the model, then it is difficult to
know what is the relevance of A, but if we have zs(A)*gp(time) + zs(B)*gp(time) then we can say that first term describes the
time-dependent effect of A and second term describes the time-dependent effect of B.

Let's assume number of levels for A is na and for B it is nb. Then the dummy interaction variable has na*nb different values and zs(dummy)*gp(time) will be a component where each of these na*nb groups have their own time trend, and they sum to
zero at all times. Whether or not this is "correct" depends on if this is what you want but sounds like it might be a good idea.

I am not sure if zs(A)*zs(B)*gp(time) is logically equivalent to zs(dummy)*gp(time) but I'm quite sure that categ(A)*categ(B)*gp(time) would be logically equivalent to categ(dummy)*gp(time). So if you want to have the interaction
then I'd say the dummy variable is the way to go. I will need to document this somewhere.

-Juho

Answer 2 · 2020-10-27T11:18:35.000Z

Thanks for the reply. This is really helpful. I have a lot of longitudinal experimental data that I was trying to model with GAMMs (mgcv::gamm) but it was kind of a nightmare. In particular, the variance-masking kernel for non-stationary effects is super nice because I found rapid onset events (what you call "diseaseAge") basically impossible to model but lgpr does it really nicely! I think one thing you could add to the discussion in your preprint to give added impact is that your tool probably would have a really broad appeal in fields like experimental ecology, not just medicine, because longitudinal datasets are very common in ecology. Lots of people rely on things for longitudinal datasets like mixed effects models, GAMs, or generalized estimating equations that work OK, but more tools are needed. I think lgpr could fill a big gap... At least I have found lots of success with it after using it for a couple of days...

ANyway... I'm hoping you could answer a couple more questions about the zero-sum versus category kernels.

I'm reading in your preprint under section "1.2.2 A zero-sum kernel for categorical covariates" and you write that zero-sum essentially keeps the effects of categories isolated without biasing the shared effects. This makes sense to me, but what then be a situation where categ() kernel would be useful? Would you use categ() kernel when you do not necessarily expect a time dependence?

This may be relevant in my particular case:

So lgpr does a very nice job capturing the overall picture here and this model prediction is right on with the data. But basically, for treatment conditions where pred and evo are both true, there is a positive offset for the sum signal. This shape of the sum of the signal is the same whether I code pred and evo as categorical or zero-sum but the term relevances are different.

categ(pred)*gp(time) + categ(evo)*gp(time) + categ(predTRUE_evoTRUE)*gp(time)

which for feature relevances gives

feature	relevance	selected
gp(day)	0.324909284	TRUE
zs(id)*gp(day)	0.005740937	FALSE
categ(pred)*gp(day)	0.092646133	TRUE
categ(evo)*gp(day)	0.011014105	FALSE
categ(pred_evo)*gp(day)	0.102607446	TRUE
gp_vm(recoveryonset)	0.352578479	TRUE
noise	0.110503616	TRUE

Also ran with zero-sum kernel

zs(pred)*gp(time) + zs(evo)*gp(time) + zs(predTRUE_evoTRUE)*gp(time)

which for feature relevances gives

feature	relevance	selected
gp(day)	0.542564175	TRUE
zs(id)*gp(day)	0.003501196	FALSE
zs(pred)*gp(day)	0.023430648	FALSE
zs(evo)*gp(day)	0.000871083	FALSE
zs(pred_evo)*gp(day)	0.028558470	TRUE
gp_vm(recoveryonset)	0.286840186	TRUE
noise	0.114234242	TRUE

So the selected relevant components don't change too much, but the actual calculated relevances do seem to change a lot. In particular, it seems that more relevance is being put into the time component with a zero-sum kernel for the interaction term.

Can you explain why this might be the case?

-shane

Answer 3 · 2020-10-31T15:19:30.000Z

Thanks for the reply. This is really helpful. I have a lot of longitudinal experimental data that I was trying to model with GAMMs (mgcv::gamm) but it was kind of a nightmare. In particular, the variance-masking kernel for non-stationary effects is super nice because I found rapid onset events (what you call "diseaseAge") basically impossible to model but lgpr does it really nicely! I think one thing you could add to the discussion in your preprint to give added impact is that your tool probably would have a really broad appeal in fields like experimental ecology, not just medicine, because longitudinal datasets are very common in ecology. Lots of people rely on things for longitudinal datasets like mixed effects models, GAMs, or generalized estimating equations that work OK, but more tools are needed. I think lgpr could fill a big gap... At least I have found lots of success with it after using it for a couple of days...

Thank you for the comments and happy to see that you have found lgpr useful. It would definitely be nice to find a way to advertise this also to ecology people etc.

... but what then be a situation where categ() kernel would be useful? Would you use categ() kernel when you do not necessarily expect a time dependence?
So the selected relevant components don't change too much, but the actual calculated relevances do seem to change a lot. In particular, it seems that more relevance is being put into the time component with a zero-sum kernel for the interaction term.
Can you explain why this might be the case?

This is very much expected behaviour. If you include categ(A)*gp(time) terms in a model, they will also absorb some of the shared time effect into them (or all of it if the model doesn't have a shared time effect term). That's why we implemented
the zero-sum kernel and started using zs(A)*gp(time) terms, which cannot absorb a time trend that is common for all categories. If categ(A) is used on its own, without multiplying by gp(), then it is an offset term like zs(A) on its own, except that categ doesn't force the sum of offsets over the categories to sum to 0.

There isn't really any reason to use categ() at all, if the relevances are computed the way lgpr currently does. I rather recommend using zs(). That's why categ() is not in the manuscript. It is just included in the package so that the software would be more general. Perhaps someone wants to create their own model selection method which fits several different lgpr models or something. In that case the relevance of the categorical covariate A could possibly be assessed by comparing how much better a model with gp(time) + gp(time)*categ(A) is than just gp(time).

Juho

Answer 4 · 2020-11-03T11:43:18.000Z

thanks! This is super helpful. zs() with a dummy variable for interactions seems like the way to go