fredzzhang/upt

The HOI loss is NaN for rank 0

OBVIOUSDAWN opened this issue · 16 comments

Dir sir,
I followed with readme to build this UPT network,but when i use the instruction
python main.py --world-size 1 --dataset vcoco --data-root ./v-coco --partitions trainval test --pretrained ../detr-r50-vcoco.pth --output-dir ./upt-r50-vcoco.pt

i got an error

`Traceback (most recent call last):
File "main.py", line 208, in
mp.spawn(main, nprocs=args.world_size, args=(args,))
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/upload/main.py", line 125, in main
engine(args.epochs)
File "/root/pocket/pocket/pocket/core/distributed.py", line 139, in call
self._on_each_iteration()
File "/root/autodl-tmp/upload/utils.py", line 138, in _on_each_iteration
raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 0`

I tried to train without pretrain model it works the same error.I tried to print the loss but it shown an empty tensor.As a beginner , i have no idea what it happened.If you could give me any help,i would be appreciated.
I look forward to receiving your reply.Thank you for a lot.

I use TITAN Xp with torch1.9.1 to train this model,and i installed the packing and testing it works,the dataset is vcoco and download with the script.Thank you for a lot.

Hi @OBVIOUSDAWN,

Thanks for taking an interest in our work.

The Nan loss problem was quite a pain. I ran into the issue a long time ago and managed to resolve it by using larger batch sizes. The problem was that the spatial encodings have bad scales, which made the training very unstable. I see that you are using only one GPU to train. So the batch size is most likely insufficient.

Here are a few things you can try

  1. For the log terms in the pairwise positional encodings, use log(1+x) instead of log(x+epsilon).
  2. Add batch norm in the spatial head that computes the pairwise positional encodings.
  3. Increase batch size (probably the easiest option).

Hope that resolves the issue.

Cheers,
Fred.

Dir sir,
I tried this model on new server with 3090*4 ,bacisize =4,which shows the same error on rank3.In your second suggestion,do you mean the "Pairwise Box Positional Encodings" in the paper?I find a "PositionEmbeddingSine" in /detr/model/position_encoding.py to change its eps show the same error ,and i also tired change "binary_focal_loss_with_logits" /"compute_spatial_encodings"'s eps in /ops.py.I print out the whole network but i dont know which part belongs to Pairwise Box Positional Encodings model.I look forward to receiving your reply.Thank you for a lot.

...do you mean the "Pairwise Box Positional Encodings" in the paper

Yes, it is implemented in ops.py. If you are running on 4 GPUs with batch size as 4, you should have an effective batch size of 16. I think that's sufficiently large. Are you still getting the error?

Fred.

yes,the effective bachsize is 16,it shows the same error.And i tride to change
features.append( torch.cat([f, torch.log(f + eps)], 1) )
by using log(1+x) instead of log(x+epsilon) in "compute_spatial_encodings",i got the same error.I look forward to receiving your reply.Thank you for a lot.

That's odd. If the batch size is 16, it should work now. Can you try some different seeds?

Fred.

Hi, @fredzzhang .
Thank you for your contribution. I am very interested in your work. Therefore, I want to deepen my understanding of your paper with running code. But I can't run the code.

I encountered the same error using the same command on 3090.
python main.py --world-size 1 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco2
I haven't changed any code, just download code and checkpoint model according to Readme.
Then I want to run the training command, but failed with this error.
Could you give me some help to solve it?

Hi @leijue222,

That should be an issue related to the batch size. I trained the model on 8 GPUs with a batch size of 2 per GPU—effectively a batch size of 16. So since you are training with one GPU, you need to set the batch size to 16.

Let me know if that works.

Fred.

Wow, Thanks Fred! It worked!
It is indeed a problem of batch size.

At present, the video memory changes from 12G to 23G. It is unknown whether the single card 3090 with bs=16 will explode the video memory later.
By the way, how much time did you spend training vcoco.
image

Towards end of the Model Zoo section, I added some stats for 8 TITAN X GPUs, which in the case of VCOCO, would be 40 minutes. I don't know how long it will take one 3090 to train it. It shouldn't be too long.

Fred.

Thanks again, I love this job.

I meet the same error using the command on 3090.

python main.py --world-size 1 --batch-size 16 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco

Could you help me to solve the problem? Thanks.

Hi @yuchen2199,

Sometimes the training could be unstable even with batch size of 16. If possible, further increasing the batch size should make it happen less often.

Fred.

Thanks for your early reply. I solved the problem after increasing batch-size. This is really an interesting work.

Hi,

I am getting the HOI loss is NaN issue when training on a different dataset. The code used to work fine earlier. But, when I tried to train on images where there is only one human bbox and one object bbox, I started facing this issue.

I have tried:

  1. Setting batch size to 16 and 32
  2. features.append(
    torch.cat([f, torch.log(f + 1)], 1)
    )
  3. adding Batch Norm
    self.spatial_head = nn.Sequential(
    nn.Linear(36, 128),
    nn.BatchNorm1d(128), # Batch normalization after the first linear layer
    nn.ReLU(),
    nn.Linear(128, 256),
    nn.BatchNorm1d(256), # Batch normalization after the second linear layer
    nn.ReLU(),
    nn.Linear(256, representation_size),
    nn.BatchNorm1d(representation_size), # Batch normalization after the third linear layer
    nn.ReLU(),
    )

But, I am still getting this issue.

Do you have any suggestions on how I can solve this issue?

To show an example,

I modified the output of compute_spatial_encodings() function like this (adding 5000 so that it's easy to visualize. ):
image

So, the input to spatial head is
image

output is
image

spatial head is
image

What is a meaningful fix for this?
Scaling, replacing problematic values in the input etc?