Question about loss in validation phase
marctimjen opened this issue · 4 comments
Hello Zhihan Gao
Thank you very much for sharing the code for this cool project!
I am currently trying to train the VAE. I just find that my training loss is quite low, about -80000 and the loss has decreased over the last epochs:
Which is good - it looks like the model is learning something! But my validation loss show quite opposite:
An increase and a final loss around 4000.
I find the issue to be regarding the d_weight that is calculated in the loss:
The final loss is calculated by:
loss = weighted_nll_loss + self.kl_weight * kl_loss + d_weight * disc_factor * g_loss
And d_weight, disc_factor > 0 but g_loss < 0. Meaning that we want the d_weight to be as large as possible (in our case 5000).
But in the validation step we always get d_weight = 0, since nll_loss and g_loss does not have any gradients in the validation phase. This means that this code: link to loss will always result in a runtime error and therefore default d_weight to 0.
Is this behavior anticipated? If yes, then I hope you would help me figure out why this is the case :).
Thank you very much for your help in advance :).
Thank you for reporting this issue. Your observation is correct. d_weight
is always 0 during validation. As mentioned in the code comments of contperceptual.py
, the implementation is adapted from the standard practice https://github.com/CompVis/taming-transformers, with only minimal changes. According to our trials, val/total_loss
should decrease over time as shown in the following charts. Based on the charts you provided, I guess the cause is that the training is still at early stage and far from convergence.
Hello again Zhihan Gao
Thank you very much for your response. You are correct in the observation that this is in the beginning of the training :).
I was wondering if I could see the log of you metrics for the training? If you have them available.
Thank you very much in advance :).
Hello again Zhihan Gao
Thank you very much for your response. You are correct in the observation that this is in the beginning of the training :).
I was wondering if I could see the log of you metrics for the training? If you have them available.
Thank you very much in advance :).
Thank you very much :)