Query about the reproducibility of the Motion Capture dataset in "Scalable Gradients..." (Li et al., 2020)
nghiahhnguyen opened this issue · 11 comments
I am trying to reproduce the results of the CMU Motion Capture dataset. I use references from examples/latent_sde_lorenz.py, the paper, and the preprocessed dataset linked by ODE2VAE's repo.
My current runs' results have large discrepancies from the results in the paper so I want to check if there are any training details I'm missing. (I am not very familiar with Bayesian modeling so I try to follow the hyperparameters in the paper as closely as possible.)
Here are the main issues:
- The validation and the test MSE for Latent SDE are in the range of 20-30, while Latent SDE's Test MSE in the paper is 4.03 +- 0.20.
- The log-likelihood term in the ELBO for Latent SDE is in the magnitude of
10^6
to10^9
depending on the choice of hyperparameters, while the code for ODE2VAE has the log-likelihood for ODE2VAE in the magnitude of10^4
.
Here are the main training details I think I am most likely to wrongly interpret from the paper:
- When calculating the log-likelihood, I am following your example and take the sum over all timesteps and the data dimension, then take the mean over the number of samples. Please let me know if this is correct for the CMU mocap dataset.
- In training, the prediction and the log-likelihood should only be calculated for the last 297 predictions (since the first 3 observations are used to encode the context).
- The solver used is not mentioned. I have tried
euler_heun
,milstein
, and evenreversible_heun
. - The initial learning rate for Adam is
0.01
in the paper, but in some runs with the initial lr to be0.001
, they seem to be more stable. I'm curious if you have any comments about this. - The dt for the solver is said to be
1/5
of the minimum time difference between two observations. All observations are regular so we can choose the minimum time to be a particular valuea
(eg., 1), thendt
would be0.2 * a
. I want to know if my interpretation here is correct. The paper didn't mention this valuea
or the start/end time. It would be nice if you remembered this.
Tagging @lxuechen since you know the most about the exp details. Thank you for your time!
After looking further, I see that the KL penalty at t0
(KL divergence between p(z)
and q(z)
) is already included. I am wondering if we also need the KL penalty at tN
(KL divergence between p(x|z)
and q(x|z)
)?
The initial learning rate for Adam is 0.01 in the paper, but in some runs with the initial lr to be 0.001, they seem to be more stable. I'm curious if you have any comments about this.
0.01 worked for my experiments if you apply reasonable decay.
After looking further, I see that the KL penalty at t0 (KL divergence between p(z) and q(z)) is already included. I am wondering if we also need the KL penalty at tN (KL divergence between p(x|z) and q(x|z))?
You would need KL at time 0, and the KL between the two stochastic processes.
A nice example of how this is computed can be found here for a toy task. logqp_path
is what you're looking for. logqp0
is the KL at time 0.
@nghiahhnguyen were you able to reproduce the results for this dataset?
@lxuechen Could you please clarify which KL reweighting coefficient from {1, 0.1, 0.01, 0.001} worked best in your experiments (for LatentODE and LatentSDE)? Also, was this with or without the linear annealing schedule?
@nghiahhnguyen were you able to reproduce the results for this dataset? @lxuechen Could you please clarify which KL reweighting coefficient from {1, 0.1, 0.01, 0.001} worked best in your experiments (for LatentODE and LatentSDE)? Also, was this with or without the linear annealing schedule?
I was able to reduce the Test MSE to the range of [8.x, 1x.x] after some changes, which is much lower than my first attempt, but still short of the figure 4.03 +- 0.20 in the paper.
Thanks for your reply, @nghiahhnguyen! Can you please share insights/hyperparameters from your setup that led to the improvement?
On my end I am getting the lower-bound around 1e7 and test MSE around 30.
Hi @abdulfatir, one point I found the most helpful, which I overlooked in my first attempt, is that the standard deviation of the predicted observations should be learnable (as stated in the paper), but the example used a fixed hyperparameter. When I noticed and changed it, Test MSE dropped significantly to [8.x, 1x.x], as I mentioned above. I'm not sure if you have noticed it already but I guess it's worth mentioning!
Other than that, I don't recall any significant changes I made. In case it helps, I'm listing some significant hyperparameters in my best run:
- Use the adjoint method.
- Solvers: reversible_heun & adjoint_reversible_heun.
- Step size: solve from t0=0.3 to t1=30 with a dt_ratio of 0.2 (I have no specific reasoning for the choice of t0 and t1, just a random thing I tried).
- Beta (the coefficient of the KL divergence in the training loss): 1 with a linear schedule of 400 epochs (Looking back at the Val plot, the Val MSE loss starts dropping significantly lower than previous runs at around epoch 500. Val MSE loss starts plateauing around epoch 670. Maybe this is an interesting correlation!)
- Gradient clipping: 0.5 (I'm using PyTorch Lightning, so this is done with their built-in implementation).
With the above setting, my log-likelihood is around x*1e4 for training, validation, and testing. Please let me know if you managed to make further progress!
Thanks a lot @nghiahhnguyen! I did use trainable scales in my experiments but only as a trainable vector (not output from a NN as in the paper).
Thanks for sharing your hyperparameters. I am giving it a go now. Will keep you posted. Are you sure about clipping the grad norm to 0.5? Given the magnitude of gradients for this model, this is quite a small value.
Something I noticed when using scales output from a NN for the observation distribution is that the model becomes very susceptible to crashing with NaNs (possibly due to numerical errors). Did you experience something similar?
I'm glad that I can be of help @abdulfatir.
I also experienced a lot of crashing with NaNs. That's why I started with such a small value of max norm for gradient clipping. I'm using gradient clipping only to stabilize the training process, so I guess you can try larger values and see if the process is still stable. I remembered that trying much larger values does not prevent NaNs-related crashing, but your setup might differ.
@nghiahhnguyen @abdulfatir
I am trying to reproduce the results on the Mocap dataset as well, however.
-
I'm stuck with an MSE of around 34. Did you calculate your MSE on the Mocap test data or training data ? Any insights into the MSE would be great!
-
Also, I noticed in the examples, the data dimension was >Time x Batch x Data<. However, Mocap data is initially available as >Batch x Time x Data<, did you do any transpose and then perform training ? I was wondering if I missed anything here.
-
Also, the latent_sde_lorenz example uses a GRU-based encoder, the paper suggests an MLP encoder with first 3 frames encoded into the input, did you make these changes as well?
Its probably been a while since you worked on this, any help is appreciated! Thank you.
Is anyone willing to share their implementation to reproduce the results of the paper? 🥺 🙏
@matteoguarrera Unfortunately, despite trying for several weeks, I wasn't able to reproduce a number anywhere close to what's reported in the paper. Finally, for these datasets, I just copied the number from prior works in our paper: https://arxiv.org/abs/2301.11308