The single card training is normal, but there is nan with multiple cards

Question

The single card training is normal, but there is nan with multiple cards

Closed this issue 2 years ago · 3 comments

hi, Adam W.
did you encounter single card training is normal, but there is nan with multiple cards

Answer 1 · 2022-11-07T18:39:51.000Z

I have not encountered this! I in fact used multi-gpu training for the models I released. Can you share more about your setup, and your train.sh?

Answer 2 · 2022-11-08T03:20:13.000Z

Thank you for your reply

python train_nuscenes.py
--exp_name=${EXP_NAME}
--max_iters=25000
--log_freq=1000
--dset='trainval'
--batch_size=4
--grad_acc=5
--use_scheduler=True
--data_dir=$DATA_DIR
--log_dir='logs_nuscenes'
--ckpt_dir='checkpoints'
--res_scale=2
--ncams=6
--encoder_type='res50'
--do_rgbcompress=True
--device_ids=[0,1,2,3]

Answer 3 · 2022-11-22T22:00:51.000Z

OK. So basically I don't know the answer here, but my strategy would be to try to simplify until it works.

For a start, how about: comment out the loss terms: https://github.com/aharley/simple_bev/blob/main/train_nuscenes.py#L203-L208

and just say total_loss = loss_fn(seg_bev_e, seg_bev_g, valid_bev_g)