KL divergence term in loss function

Question

KL divergence term in loss function

Closed this issue 6 years ago · 1 comments

Hi,
I believe there is a bug in the implementation of the VAE loss. You take the mean over the latent dimension rather than the sum. I.e. the line:
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
should be:
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)

The same bug is present in the orignal character based smiles autoencoder:
maxhodak/keras-molecules#59

Answer 1 · 2017-08-23T13:03:42.000Z

Thanks for pointing this out. This actually improves performance as it helps to make sure that the encoder doesn't start by pushing the KL term to 0, a problem also encountered by Bowman et al.: https://arxiv.org/pdf/1511.06349.pdf Probably a more principled thing to do would be to put a scalar in front of the KL term and anneal it from 0 to 1 like Bowman et al. did. This way you'd make sure to optimize the true VAE loss at the end of training. But this seems to work as well and if you replace the mean with a sum it actually hurts training performance a lot.