stepankonev/waymo-motion-prediction-challenge-2022-multipath-plus-plus

Huge Negative Loss During Training

SwagJ opened this issue · 14 comments

SwagJ commented

Hi @stepankonev,

When I followed your instruction and train your implementation. The training loss drops to -200+ with in early iteration of first epoch. Did this also happen when you trained your network? I am looking forward to your reply. Thank you in advance.

Best,

Hi,
You may try training with normalize_output: False to avoid the normalization of the ground truth and the predictions. This should lead you the loss values you got used to

SwagJ commented

Hi @stepankonev,

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised. Also, During the training, there will be one batch with mismatched batch_size problem. Error message is as follows:
Traceback (most recent call last): File "train.py", line 119, in <module> probas, coordinates, _, _ = model(data, num_steps) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data1/motion/multipath_pp/code/model/multipathpp.py", line 64, in forward final_embedding = torch.cat( RuntimeError: Sizes of tensors must match except in dimension 0. Got 41 and 42 (The offending index is 0)
Does this happens because the nomralize flag is different in final_RoP_Cov_Single.yaml and prerender.yaml? I am looking forward to your reply.

Best,

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised.

@SwagJ I have the same problem when setting normalize_output to False or setting normalize to False.

SwagJ commented

Hi @ares89,

So, when you set normalize_output and normalize both to True, is that the negative loss problem disappearing or the assertion error triggered ?

When setting normalize_output and normalize both to True, the loss in my training is negative in first epoch and drops to around -400 in the following 5 epochs.
I work with the scenario format of waymo motion dataset not the tfExample format.

SwagJ commented

Well, I used the tfExample format. I guess the problem on mismatched batch dimension might come from this. Would you kindly provide what format of data you were using? @stepankonev

I do not have the exact values of the loss rn, but generally speaking negative loss is not a problem and not an error here, I guess your values should be ok. As for batch mismatch size: I will check it later, I don't remember facing this problem. However, I guess it is not about the tfScenario

@SwagJ If you set trainable_cov to False in decoder_handler_config, no assertion error on finitiness of covariance_matrices will be raised.

SwagJ commented

Hi @ares89,

Did you come across the mismatched batch dimension when using Scenario format?

No, I haven't

SwagJ commented

Hi @ares89 , I see. Thank you then.

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised.

@SwagJ I have the same problem when setting normalize_output to False or setting normalize to False.

How about changing running_mean_mode from "real" to "sliding"? might help avoid the infinite value problem.

Hi @stepankonev,

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised. Also, During the training, there will be one batch with mismatched batch_size problem. Error message is as follows: Traceback (most recent call last): File "train.py", line 119, in <module> probas, coordinates, _, _ = model(data, num_steps) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data1/motion/multipath_pp/code/model/multipathpp.py", line 64, in forward final_embedding = torch.cat( RuntimeError: Sizes of tensors must match except in dimension 0. Got 41 and 42 (The offending index is 0) Does this happens because the nomralize flag is different in final_RoP_Cov_Single.yaml and prerender.yaml? I am looking forward to your reply.

Best,

hi, @SwagJ. i met the same problem that the loss drops quickly and finally it came to around -300, when setting normalize_output and normalize both to True. As for the batch_size problem, once all the data in both training folder and validation folder are the multiple of the batch_size number, there will not show the miss-match error. But I have no idea if it is truly the solution or just coincidence.

Hi @stepankonev,

When I followed your instruction and train your implementation. The training loss drops to -200+ with in early iteration of first epoch. Did this also happen when you trained your network? I am looking forward to your reply. Thank you in advance.

Best,

Is it caused by the cofficent n as this material page3