Rayhane-mamah/Efficient-VDVAE

Mismatch between config and logs (for Cifar10)?

Closed this issue · 3 comments

Thank you for providing the code! I was looking at your configs for more details about the architectures used and noticed a mismatch to the logs provided for Cifar10 (I haven't checked it for any of the other experiments). In the config you specify the number of steps as 800k but the tensorboard logs seem to have 1.1M steps. Could you update the config to match the logs or clarify what I'm misunderstanding? Thank you very much!

Hello @msbauer
Thank you for reaching out!

To make sure that I understand the issue:
When downloading the (JAX) pre-trained models and looking at the Tensorboard (TB) logs + saved config (hparams.cfg), you notice that the argument total_train_steps doesn't match between hparams.cfg and what's displayed on Tensorboard:

  • total_train_steps shows 800k
  • Tensorboard curves show the model was trained for ~1.1M updates.
  • For completeness: The CIFAR-10 config from the egs folder also show 800k updates.

The short answer:

  • You are right, that is a parameter that we didn't pay attention to when cleaning up the codebase + cleaning up the hparams.cfg of models trained on the old codebase (before cleaning for open source release). We will correct the released pre-trained model config with the correct values of our old checkpoints.

The long answer:

  • The CIFAR-10 checkpoints you are looking at (and other JAX checkpoints) are trained for longer than the example configs given in egs for experimentation and verification purposes.
  • During our experimentation phase of the project, we consistently evaluated the models at several points during their training. Once the models seemingly converge (In CIFAR-10 case, at 800k steps), we let them train for an extra period of time to make sure we aren't prematurely killing them (In CIFAR-10 case, until ~1.1M steps).
  • The parameters in egs are sufficient to reproduce all results presented in the paper/checkpoints (up to seed difference), and any extra training updates apparent on Tensorboard are unnecessary. (In CIFAR-10 case, training for 800k or 1.1M steps gives the same Negative ELBO as seen by the flat Negative ELBO between 800k and 1.1M updates).

image

  • During open-sourcing the codebase, we cleaned up some of the irrelevant code (as is usual practice). The cleanup involved the removal of some irrelevant parameters from hparams.cfg.
  • Any released pre-trained models for which the wall time (on Tensorboard) shows dates prior to March 25th 2022 were trained during the experimentation phases of the project. To release these models, we "salvaged" their pre-trained model weights and adapted their hparams.cfg to our cleaned codebase.
  • With all of that said, it is possible for few hyper-parameters to go unnoticed and show a mismatch between TB logs and hparams.cfg, like the total number of training updates.
  • Any released pre-trained models for which the wall time (on TB) shows dates after March 25th 2022 (For example the Pytorch pre-trained models) were trained with the exact config files from egs, so they shouldn't have any discrepancies.
  • "Salvaged" pre-trained models are mainly released to speed up the time of release of checkpoints. Other pre-trained models are intentionally re-trained for checkpoints release to serve as final reproducibility tests.
  • We highly recommend training with the egs parameters as they are the most computationally efficient designs that satisfy the same Negative ELBO performance.

Your concern is absolutely justified, so thank you for catching this and reporting it. We will update the pre-trained model configs with more carefully verified train sections.

Please let us know if the answer is unclear or if there are any more issues. :)
We will close the issue after updating the pre-trained models.

Hello again,

The pre-trained model configs have been updated to conform to the rest of the logs (by correcting the forgotten parameters from the original saved models hparams, before cleanup). Re-downloading the pre-trained models (all datasets) shouldn't have the same problem. We quickly verified the entire config parameters for consistency and they should match TB logs correctly.

Thank you again for pointing out to the issue. Such inconsistencies can be red flags for incorrectly uploaded models :)
I am closing the issue since it seems the original bug is fixed. Feel free to re-open it if there are any similar problems.

Thanks,
Rayhane.

Dear Rayhane!

Thank you so much for the extremely detailed and super fast response. I really appreciate it!

Best wishes,
Matthias