gaozhihan/PreDiff

Question about loss in validation phase

marctimjen opened this issue · 4 comments

Hello Zhihan Gao

Thank you very much for sharing the code for this cool project!

I am currently trying to train the VAE. I just find that my training loss is quite low, about -80000 and the loss has decreased over the last epochs:

image

Which is good - it looks like the model is learning something! But my validation loss show quite opposite:

image

An increase and a final loss around 4000.

I find the issue to be regarding the d_weight that is calculated in the loss:

Link to loss

The final loss is calculated by:
loss = weighted_nll_loss + self.kl_weight * kl_loss + d_weight * disc_factor * g_loss

And d_weight, disc_factor > 0 but g_loss < 0. Meaning that we want the d_weight to be as large as possible (in our case 5000).

But in the validation step we always get d_weight = 0, since nll_loss and g_loss does not have any gradients in the validation phase. This means that this code: link to loss will always result in a runtime error and therefore default d_weight to 0.

Is this behavior anticipated? If yes, then I hope you would help me figure out why this is the case :).

Thank you very much for your help in advance :).

Thank you for reporting this issue. Your observation is correct. d_weight is always 0 during validation. As mentioned in the code comments of contperceptual.py, the implementation is adapted from the standard practice https://github.com/CompVis/taming-transformers, with only minimal changes. According to our trials, val/total_loss should decrease over time as shown in the following charts. Based on the charts you provided, I guess the cause is that the training is still at early stage and far from convergence.

VAE_val

Hello again Zhihan Gao

Thank you very much for your response. You are correct in the observation that this is in the beginning of the training :).

I was wondering if I could see the log of you metrics for the training? If you have them available.

Thank you very much in advance :).

Hello again Zhihan Gao

Thank you very much for your response. You are correct in the observation that this is in the beginning of the training :).

I was wondering if I could see the log of you metrics for the training? If you have them available.

Thank you very much in advance :).

Sure. Here is an example training log at its early stage.
train_log_1
train_log_2

Thank you very much :)