Compact models, configs, and checkpoints
msbauer opened this issue · 4 comments
Hi!
I've been looking further into the code to understand the hyperparemeter choices for the different configurations discussed in the paper. Do I understand correctly that C1
corresponds to the configs ..._compact
whereas C2
corresponds to the configs ..._baseline
? I've been trying to understand the C1
configs with varying width in more details but noticed that the _compact
config files seem to specify models with fewer parameters than specified in Table 3 of the paper. For example, I've tried to compute the rough number of parameters for the imagenet32_baseline.cfg
and get the 156M parameters for model C2
specified in the table. However, imagenet32_compact.cfg
yields ~20M parameters in total, while Table 3 specifies ~52M parameters for C1
.
Therefore, I just wanted to ask how the configs you provide map to the models presented in the paper and whether you could provide the configs for C1
and/or a checkpoint file for one of the compact models?
Thank you very much, I really appreciate it!
Best wishes,
Matthias
Hello @msbauer, sorry for the late reply!
I believe this question can be answered as follows:
- It seems we did a poor job at stating that
hparams.cfg
files under egs are to be regarded only as examples to guide the user. They are intentionally written to be "simple to interpret" compared tohparams.cfg
you can find in the pre-trained checkpoints for example. The "compact" hparams we provide are example configurations that can be used by low end machines (where large amounts of memory is not available). They are only related toC1
in that they use incremental filter sizes, but not necessarily the same filter sizes asC1
. - The
C1
configuration files you're trying to replicate can be found here. - Ultimately, we wish the user to be free in their choice of configuration of the model. We wrote a model class that can adapt to different types of HVAEs (symmetrical like NVAE or asymmetrical like VDVAE or any combination of the two). We also made ablations in our paper to reflect the different effects of different techniques on compute (including architecture/optimization params design). We also stabilized prior VDVAE work and made experimentation more accessible. Either
C1
orC2
we show in our work are merely some of the models we tried in our experimentations that we thought would make for a good comparison with prior work.C1
were picked such that negative elbo wouldn't be too different fromC2
while saving memory. More compact models thanC2
are definitely trainable (examples of such models arecompact
in egs. - We envision people to get inspired by the egs we provide and design whatever architecture that fits their need (from NLL/compute trade-off perspective, or from their application perspective).
- Finally, we are well aware that there are two minor typos in the number of parameters in table 3 (number of params: Imagenet 32x32 C2 and Cifar-10 C2). They have been corrected and will appear in the next arxiv preprint version. The "v2" table 3 looks something like this:
Separate but maybe relevant notes:
- As is common knowledge (maybe?), the number of parameters alone wasn't a good indicator of final performance in our experiments, we definitely had small models with better NLL than bigger models. (A trivial example of "useless parameters" is the added bias at every resolution that we just kept to conform to prior work). It might be worth keeping that in mind if experimenting with Efficient-VDVAE.
- We mainly added the "# params" in table 3 to highlight that: when using the same filter size as prior work, our model has more trainable parameters (due to the modifications of appendix A).
- There isn't necessarily any "logic" in picking the filter size incremental scheme aside from "make high resolution thin to save up on memory". The rest is just kept for experimentation and empirical studies.
Hope this covers your question and I hope the answer is clear :)
I will close the issue since I believe the main point was addressed, feel free to re-open it if I missed anything important.
Thank you again for your interest in our work. We appreciate that!
As usual, please feel free to ask or raise issues any time you see them!
Rayhane.
Dear @Rayhane-mamah,
Thank you (again :)) very much for the quick and very detailed response including the updated Table 3 as well as the configurations for the C1
models. I really appreciate it!
I was mainly looking (and asking) for the configs as you specified in the caption of Table 3 More detailed model hyper-parameters are available in Table 6 and the source code
. As Table 6 doesn't specify the width
s, I was expecting the hyper-parameters for the C1
and C2
models to be in the egs
folder (to get a better intuition for the kind of model sizes/widths that would be necessary to achieve a certain performance). I think they are also good starting points for practitioners who want to explore extensions and don't want to spend ~a week exploring hyper-parameters :-).
I also really appreciate and agree with the sentiment in your notes (as well as in the paper) that the optimal settings are dataset dependent and that "more parameters = better" definitely isn't true.
Best wishes
Matthias
Hello again @msbauer
Your note about table 6 is an extremely good point! We somehow didn't consider this type of situation. This is definitely a consequence of not having enough space in the paper. We probably simply forgot that it would be useful to provide C1
detailed description of the model width afterwards. Now that we are aware of this issue, we will add more details about the C1
configuration to the codebase.
Thank you very much for bringing that to our attention!
Rayhane.