JialianW/TraDeS

When I use the nuScences-train dataset to train to 12 epochs, the cost volume is nan, what causes this?

Closed this issue · 2 comments

ZZXin commented

Hi, Thanks for your awesome job! When I use the nuScences-train dataset to train to 12 epochs, the cost volume is nan, what causes this? The Initial learning rate is 1.25e-4,the backbone is resdcn18.

How many GPUs are you using? BTW, we haven't tried other backbones except for the dla-34. Not sure if this is a problem.

ZZXin commented

How many GPUs are you using? BTW, we haven't tried other backbones except for the dla-34. Not sure if this is a problem.

I only have 2 gpus, So I just set batchsize to 12, I don’t think this should be a problem with the backbone.
Now I am trying to reduce the size of the training image and setting batchsize to 32, and there is no loss of NAN for the time being.