Issue on computing log-likelihood (bpd)
Closed this issue · 1 comments
There are three ways to compute bpds for a trained model.
- Uniform dequantization: the most popular, but the worst bpd.
- Variational dequantization: quite popular, great bpd.
- Lossless: the way to compute bpd in VAE/Discrete Diffusion, great bpd.
I have spent few months to train the flow network for the variational dequantization, but with almost all the released and well-established codes, I failed to train the flow network successfully. Here, what "successful" means that the bpd with var. deq. is not decreased from the bpd with uniform deq. around 0.10~0.14 as Song reported in his paper (https://arxiv.org/pdf/2101.09258.pdf). I only get ~0.02 gain with Song's original flow network, and I got ~0.05 gain with the bestly performed implementation.
After this experience, I decided to stop delving into training the unstable flow network. Rather than that, I focused on the lossless computation following Ho's original DDPM paper. However, it turned out that the lossless bpd is significantly worse than the bpd with unif. deq., and we are suspecting that our code is wrong.
The problem is that we cannot find any wrong point in our code. I hope if anyone successes on the lossless computation, and please let me know for his/her's know-how.
As always, the reviewers do not consider our effort, but it is extremely unfair to compare the bpd with unif. deq. with prior works that computed their bpd with var. deq. This unfair comparison could be a potential cause of paper rejection, so it is left us to invest our precious time on training variational flow network until the training succeeds. This is a huge waste of time for colleague researchers, and if the lossless bpd computation is successful and stable, then it would allow you to fairly compare your model with prior works without the flow learning! So please let's find out the way to compute lossless bpd, which is as cheap as the bpd with uniform dequantization.
We finally (!) solved the problem by applying the architecture borrowed from DenseFlow. The variational dequantization from DenseFlow turns out to be highly stable in training with 2e-4 learning rate. See https://github.com/matejgrcic/DenseFlow for the reference. The final performance after the variational dequantization is reported in our newest version of the paper (https://arxiv.org/pdf/2106.05527.pdf).