Huge Negative Loss During Training

Question

Huge Negative Loss During Training

SwagJ opened this issue 2 years ago · 14 comments

When I followed your instruction and train your implementation. The training loss drops to -200+ with in early iteration of first epoch. Did this also happen when you trained your network? I am looking forward to your reply. Thank you in advance.

Best,

Answer 1 · 2022-07-05T05:31:41.000Z

Hi,
You may try training with normalize_output: False to avoid the normalization of the ground truth and the predictions. This should lead you the loss values you got used to

Answer 2 · 2022-07-05T05:58:01.000Z

Hi @stepankonev,

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised. Also, During the training, there will be one batch with mismatched batch_size problem. Error message is as follows:
Traceback (most recent call last): File "train.py", line 119, in <module> probas, coordinates, _, _ = model(data, num_steps) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data1/motion/multipath_pp/code/model/multipathpp.py", line 64, in forward final_embedding = torch.cat( RuntimeError: Sizes of tensors must match except in dimension 0. Got 41 and 42 (The offending index is 0)
Does this happens because the nomralize flag is different in final_RoP_Cov_Single.yaml and prerender.yaml? I am looking forward to your reply.

Best,

Answer 3 · 2022-07-05T07:00:59.000Z

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised.

@SwagJ I have the same problem when setting normalize_output to False or setting normalize to False.

Answer 4 · 2022-07-05T07:43:56.000Z

Hi @ares89,

So, when you set normalize_output and normalize both to True, is that the negative loss problem disappearing or the assertion error triggered ?

Answer 5 · 2022-07-05T08:03:03.000Z

When setting normalize_output and normalize both to True, the loss in my training is negative in first epoch and drops to around -400 in the following 5 epochs.
I work with the scenario format of waymo motion dataset not the tfExample format.

Answer 6 · 2022-07-05T08:52:12.000Z

Well, I used the tfExample format. I guess the problem on mismatched batch dimension might come from this. Would you kindly provide what format of data you were using? @stepankonev

Answer 7 · 2022-07-05T13:43:18.000Z

I do not have the exact values of the loss rn, but generally speaking negative loss is not a problem and not an error here, I guess your values should be ok. As for batch mismatch size: I will check it later, I don't remember facing this problem. However, I guess it is not about the tfScenario

Answer 8 · 2022-07-07T10:21:06.000Z

@SwagJ If you set trainable_cov to False in decoder_handler_config, no assertion error on finitiness of covariance_matrices will be raised.

Answer 9 · 2022-07-08T11:13:26.000Z

Hi @ares89,

Did you come across the mismatched batch dimension when using Scenario format?

Answer 10 · 2022-07-11T02:04:03.000Z

No, I haven't

Answer 11 · 2022-07-11T06:10:40.000Z

Hi @ares89 , I see. Thank you then.

Answer 12 · 2022-07-12T10:13:37.000Z

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised.

@SwagJ I have the same problem when setting normalize_output to False or setting normalize to False.

How about changing running_mean_mode from "real" to "sliding"? might help avoid the infinite value problem.

Answer 13 · 2022-11-18T08:04:40.000Z

Hi @stepankonev,

The negative loss actually happened using your default config, which is normalize_output: True. If setting it to False, an assertion error on finitiness of covariance_matrices will be raised. Also, During the training, there will be one batch with mismatched batch_size problem. Error message is as follows: Traceback (most recent call last): File "train.py", line 119, in <module> probas, coordinates, _, _ = model(data, num_steps) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data1/motion/multipath_pp/code/model/multipathpp.py", line 64, in forward final_embedding = torch.cat( RuntimeError: Sizes of tensors must match except in dimension 0. Got 41 and 42 (The offending index is 0) Does this happens because the nomralize flag is different in final_RoP_Cov_Single.yaml and prerender.yaml? I am looking forward to your reply.

Best,

hi, @SwagJ. i met the same problem that the loss drops quickly and finally it came to around -300, when setting normalize_output and normalize both to True. As for the batch_size problem, once all the data in both training folder and validation folder are the multiple of the batch_size number, there will not show the miss-match error. But I have no idea if it is truly the solution or just coincidence.

Answer 14 · 2023-08-08T06:30:11.000Z

Hi @stepankonev,

When I followed your instruction and train your implementation. The training loss drops to -200+ with in early iteration of first epoch. Did this also happen when you trained your network? I am looking forward to your reply. Thank you in advance.

Best,

Is it caused by the cofficent n as this material page3