JenniNiku/gllvm

Output memory consumption

Closed this issue · 7 comments

Hi!

Firstly, thanks for your work and this package :)

I have a memory consumption issue when I create a model. In this toy exemple, my dataset is composed of 92 species from 172 sampling and 7 environmental variables. I also include 20 latent variables.

When the model has been fit, the size of the output is approximately 42 GB. I've fitted a NB and binomial distribution with the same results.

So I dig a bit into it and this is what I found.The source of the problem is from out$terms and out$Hess. I quickly look into your code and I saw that you assign at the very end these two components to the output. Are they mandatory, or could gllvm loose some weight?

image

Thanks for trying to use the package! Including the Hessian in the final output is important so that calculations don't have to be repeated in other functions that require covariance between the parameters. However, perhaps @JenniNiku, it is possible to include an argument in the form of hess.return = TRUE to provide the possibility of not returning the hessian?

@clementviolet, the Hessian is quite big when including 20 latent variables, 92 species and 172 sites. Are you sure you require 20 latent variables in the first place, in addition to 7 covariates? Generally, 1-10 latent variables is considered more than sufficient in community analysis as there is rarely sufficient support in a dataset for that many latent variables (and there is little theory for communities that are driven by more complex ecological gradients than that). Here, you have something in the vicinity of 1650 parameters on the latent variables, in addition to intercepts, slopes and overdispersion parameters (for NB), totalling at around 2+7+1650/92 , so on average around 27 parameters per species. Have you tried to fit models with fewer latent variables, and compared fits? Naturally the size of the Hessian is big, with that number of parameters (in addition to means, variance and covariance parameters per row of the response matrix, if fit with method="VA").

If you have any other questions, feel free to ask.

I know that there is too much parameters for this model, I just wanted to emphasis this behavior.

Ok, I look at the code too quick, I didn't see that it was used by other functions.

But I was more concerned about the terms object. It seems that R is making a deep copy of the whole formula environment (I don't really understand this R's behavior). If it's an expected behavior, I'm ok with this (and I'll close this issue soon).

btw, how do you determine the number of parameters to be estimated per latent variable?

Yes, I'm a bit surprised about the size of terms too. I don't have any clue why it does that, so we will have to wait for Jenni to comment.

Something like this works: p*d-(d*(d-1))/2, where p is the number of species and d the number of latent variables. The quick-and-dirty way in R to do this is: sum(upper.tri(matrix(NA,ncol=d,nrow=p),diag=F)).

I've run another model with only one latent variable as you suggest and using a binomial distribution with the same dataset. I wanted to see if out$terms still consuming that much memory. Now the model size has decreased to 125 MB, but most of the memory is still consumed by out$terms. In both case, out$terms is twice as big as the Hessian matrix.

image

Yes, I had a quick look at terms now and it seems that you are right, it copies the function environment. I.e. attributes(mod$terms)$.Environment seems to include all input to the gllvm function.

I look more deeply at the environment, and quickly I saw some redundancy. For instance, y and X matrices are stored at least four times.

ls.str(attributes(mod$terms)$.Environment)

It should be possible to trim some weight from this environment. For large models, this could increase their speed.

I don't think it should affect the computation time (speed) in any way, as no calculations are performed with terms. The only thing that it should affect when reduced, is the size of the model object.