NVlabs/few-shot-vid2vid

fsVid2Vid--ZeroDivisionError: float division by zero

Gopalsvs opened this issue · 0 comments

Hello all,

We had an issue while training the fs-Vid2Vid model on a similar dataset compared to that of Youtube Dancing, we created all the 3 other folder poses-openpose,pose_maps-densepose, human_instance_maps for all the sequences and there are 3000 sequences. While training we got ZERO DIVISION ERROR after model completed 5 epoch. We confirmed the dataset do not contain any None images in images folder, pose_maps-densepose folder, human_instance_maps folder, we also confirmed no empty JSON files in poses-openpose. We kept batch size 2 and trained with a single GPU. We also decreased the dataset to 500 sequences and then tried to train, the same error occurred after the 7th epoch.
Is there a fix for this error?

This is the exact error we got:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 1.265e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 6.3e-322
Traceback (most recent call last):
File "train.py", line 93, in
main()
File "train.py", line 78, in main
trainer.gen_update(data)
File "/mnt/fs/imaginaire/imaginaire/trainers/vid2vid.py", line 283, in gen_update
self.get_gen_losses(data_t, net_G_output, net_D_output)
File "/mnt/fs/imaginaire/imaginaire/trainers/vid2vid.py", line 537, in get_gen_losses
scaled_loss.backward()
File "/home/ubuntu/anaconda3/lib/python3.8/contextlib.py", line 120, in exit
next(self.gen)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 123, in post_backward_models_are_masters
scaler.unscale(
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
ZeroDivisionError: float division by zero

Thanks for your time.