Huge Negative Loss During Training
SwagJ opened this issue · 14 comments
Hi @stepankonev,
When I followed your instruction and train your implementation. The training loss drops to -200+ with in early iteration of first epoch. Did this also happen when you trained your network? I am looking forward to your reply. Thank you in advance.
Best,
Hi,
You may try training with normalize_output: False
to avoid the normalization of the ground truth and the predictions. This should lead you the loss values you got used to
Hi @stepankonev,
The negative loss actually happened using your default config, which is normalize_output: True
. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised. Also, During the training, there will be one batch with mismatched batch_size problem. Error message is as follows:
Traceback (most recent call last): File "train.py", line 119, in <module> probas, coordinates, _, _ = model(data, num_steps) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data1/motion/multipath_pp/code/model/multipathpp.py", line 64, in forward final_embedding = torch.cat( RuntimeError: Sizes of tensors must match except in dimension 0. Got 41 and 42 (The offending index is 0)
Does this happens because the nomralize
flag is different in final_RoP_Cov_Single.yaml and prerender.yaml? I am looking forward to your reply.
Best,
The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised.
@SwagJ I have the same problem when setting normalize_output to False or setting normalize to False.
Hi @ares89,
So, when you set normalize_output and normalize both to True, is that the negative loss problem disappearing or the assertion error triggered ?
When setting normalize_output
and normalize
both to True, the loss in my training is negative in first epoch and drops to around -400 in the following 5 epochs.
I work with the scenario format of waymo motion dataset not the tfExample format.
Well, I used the tfExample format. I guess the problem on mismatched batch dimension might come from this. Would you kindly provide what format of data you were using? @stepankonev
I do not have the exact values of the loss rn, but generally speaking negative loss is not a problem and not an error here, I guess your values should be ok. As for batch mismatch size: I will check it later, I don't remember facing this problem. However, I guess it is not about the tfScenario
@SwagJ If you set trainable_cov to False in decoder_handler_config, no assertion error on finitiness of covariance_matrices will be raised.
Hi @ares89,
Did you come across the mismatched batch dimension when using Scenario format?
No, I haven't
The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised.
@SwagJ I have the same problem when setting normalize_output to False or setting normalize to False.
How about changing running_mean_mode
from "real" to "sliding"? might help avoid the infinite value problem.
Hi @stepankonev,
The negative loss actually happened using your default config, which is
normalize_output: True
. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised. Also, During the training, there will be one batch with mismatched batch_size problem. Error message is as follows:Traceback (most recent call last): File "train.py", line 119, in <module> probas, coordinates, _, _ = model(data, num_steps) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data1/motion/multipath_pp/code/model/multipathpp.py", line 64, in forward final_embedding = torch.cat( RuntimeError: Sizes of tensors must match except in dimension 0. Got 41 and 42 (The offending index is 0)
Does this happens because thenomralize
flag is different in final_RoP_Cov_Single.yaml and prerender.yaml? I am looking forward to your reply.Best,
hi, @SwagJ. i met the same problem that the loss drops quickly and finally it came to around -300, when setting normalize_output and normalize both to True. As for the batch_size problem, once all the data in both training folder and validation folder are the multiple of the batch_size number, there will not show the miss-match error. But I have no idea if it is truly the solution or just coincidence.
Hi @stepankonev,
When I followed your instruction and train your implementation. The training loss drops to -200+ with in early iteration of first epoch. Did this also happen when you trained your network? I am looking forward to your reply. Thank you in advance.
Best,
Is it caused by the cofficent n as this material page3