Rayhane-mamah/Efficient-VDVAE

Compact models, configs, and checkpoints

msbauer opened this issue · 4 comments

Hi!

I've been looking further into the code to understand the hyperparemeter choices for the different configurations discussed in the paper. Do I understand correctly that C1 corresponds to the configs ..._compact whereas C2 corresponds to the configs ..._baseline? I've been trying to understand the C1 configs with varying width in more details but noticed that the _compact config files seem to specify models with fewer parameters than specified in Table 3 of the paper. For example, I've tried to compute the rough number of parameters for the imagenet32_baseline.cfg and get the 156M parameters for model C2 specified in the table. However, imagenet32_compact.cfg yields ~20M parameters in total, while Table 3 specifies ~52M parameters for C1.

Therefore, I just wanted to ask how the configs you provide map to the models presented in the paper and whether you could provide the configs for C1 and/or a checkpoint file for one of the compact models?

Thank you very much, I really appreciate it!
Best wishes,
Matthias

Hello @msbauer, sorry for the late reply!

I believe this question can be answered as follows:

  • It seems we did a poor job at stating that hparams.cfg files under egs are to be regarded only as examples to guide the user. They are intentionally written to be "simple to interpret" compared to hparams.cfg you can find in the pre-trained checkpoints for example. The "compact" hparams we provide are example configurations that can be used by low end machines (where large amounts of memory is not available). They are only related to C1 in that they use incremental filter sizes, but not necessarily the same filter sizes as C1.
  • The C1 configuration files you're trying to replicate can be found here.
  • Ultimately, we wish the user to be free in their choice of configuration of the model. We wrote a model class that can adapt to different types of HVAEs (symmetrical like NVAE or asymmetrical like VDVAE or any combination of the two). We also made ablations in our paper to reflect the different effects of different techniques on compute (including architecture/optimization params design). We also stabilized prior VDVAE work and made experimentation more accessible. Either C1 or C2 we show in our work are merely some of the models we tried in our experimentations that we thought would make for a good comparison with prior work. C1 were picked such that negative elbo wouldn't be too different from C2 while saving memory. More compact models than C2 are definitely trainable (examples of such models are compact in egs.
  • We envision people to get inspired by the egs we provide and design whatever architecture that fits their need (from NLL/compute trade-off perspective, or from their application perspective).
  • Finally, we are well aware that there are two minor typos in the number of parameters in table 3 (number of params: Imagenet 32x32 C2 and Cifar-10 C2). They have been corrected and will appear in the next arxiv preprint version. The "v2" table 3 looks something like this:

image

Separate but maybe relevant notes:

  • As is common knowledge (maybe?), the number of parameters alone wasn't a good indicator of final performance in our experiments, we definitely had small models with better NLL than bigger models. (A trivial example of "useless parameters" is the added bias at every resolution that we just kept to conform to prior work). It might be worth keeping that in mind if experimenting with Efficient-VDVAE.
  • We mainly added the "# params" in table 3 to highlight that: when using the same filter size as prior work, our model has more trainable parameters (due to the modifications of appendix A).
  • There isn't necessarily any "logic" in picking the filter size incremental scheme aside from "make high resolution thin to save up on memory". The rest is just kept for experimentation and empirical studies.

Hope this covers your question and I hope the answer is clear :)
I will close the issue since I believe the main point was addressed, feel free to re-open it if I missed anything important.

Thank you again for your interest in our work. We appreciate that!
As usual, please feel free to ask or raise issues any time you see them!
Rayhane.

Dear @Rayhane-mamah,

Thank you (again :)) very much for the quick and very detailed response including the updated Table 3 as well as the configurations for the C1 models. I really appreciate it!

I was mainly looking (and asking) for the configs as you specified in the caption of Table 3 More detailed model hyper-parameters are available in Table 6 and the source code. As Table 6 doesn't specify the widths, I was expecting the hyper-parameters for the C1 and C2 models to be in the egs folder (to get a better intuition for the kind of model sizes/widths that would be necessary to achieve a certain performance). I think they are also good starting points for practitioners who want to explore extensions and don't want to spend ~a week exploring hyper-parameters :-).

I also really appreciate and agree with the sentiment in your notes (as well as in the paper) that the optimal settings are dataset dependent and that "more parameters = better" definitely isn't true.

Best wishes
Matthias

Hello again @msbauer

Your note about table 6 is an extremely good point! We somehow didn't consider this type of situation. This is definitely a consequence of not having enough space in the paper. We probably simply forgot that it would be useful to provide C1 detailed description of the model width afterwards. Now that we are aware of this issue, we will add more details about the C1 configuration to the codebase.

Thank you very much for bringing that to our attention!
Rayhane.

Hello,

We have added the C1 configs since their width information was missing from both paper + codebase.
For completeness sake, here's what the "v2" table 6 will look like in the next paper update:
image

Thank you again for pointing this out.
Rayhane.