quanpn90/NMTGMinor

uncorrectable NVLink error

Opened this issue · 0 comments

Hi Quan,
I am training my model with train.py(not train_distribute.py), using one single GPU(A100).
In my environment, it is PyTorch1.8, CUDA 11.1
I encounter the following problem. Could you help have a look?
(I tested in PyTorch1.9 as well, and I didn't have the crash)

Epoch 1, 1/15756; ; ppl: 32479.69 ; lr: 0.0883883 ; updates: 0; 890 src tok/s; 1200 tgt tok/s; 0:00:08 elapsed
Epoch 1, 100/15756; ; ppl: 29296.45 ; lr: 0.0000084 ; updates: 25; 19821 src tok/s; 36381 tgt tok/s; 0:00:37 elapsed
Epoch 1, 200/15756; ; ppl: 14829.10 ; lr: 0.0000169 ; updates: 50; 40044 src tok/s; 67334 tgt tok/s; 0:00:52 elapsed
Epoch 1, 300/15756; ; ppl: 8825.84 ; lr: 0.0000253 ; updates: 75; 42289 src tok/s; 74924 tgt tok/s; 0:01:06 elapsed
Epoch 1, 400/15756; ; ppl: 5181.33 ; lr: 0.0000337 ; updates: 100; 43419 src tok/s; 76316 tgt tok/s; 0:01:20 elapsed
Epoch 1, 500/15756; ; ppl: 2660.08 ; lr: 0.0000421 ; updates: 125; 42570 src tok/s; 74262 tgt tok/s; 0:01:34 elapsed
Epoch 1, 600/15756; ; ppl: 1374.20 ; lr: 0.0000506 ; updates: 150; 44735 src tok/s; 72865 tgt tok/s; 0:01:48 elapsed
Epoch 1, 700/15756; ; ppl: 800.48 ; lr: 0.0000590 ; updates: 175; 41216 src tok/s; 68691 tgt tok/s; 0:02:03 elapsed
Epoch 1, 800/15756; ; ppl: 562.11 ; lr: 0.0000674 ; updates: 200; 26377 src tok/s; 45662 tgt tok/s; 0:02:26 elapsed
Epoch 1, 900/15756; ; ppl: 433.20 ; lr: 0.0000759 ; updates: 225; 44214 src tok/s; 73822 tgt tok/s; 0:02:40 elapsed
Epoch 1, 1000/15756; ; ppl: 372.43 ; lr: 0.0000843 ; updates: 250; 44327 src tok/s; 71238 tgt tok/s; 0:02:54 elapsed
Epoch 1, 1100/15756; ; ppl: 304.53 ; lr: 0.0000927 ; updates: 275; 43036 src tok/s; 73235 tgt tok/s; 0:03:09 elapsed
Epoch 1, 1200/15756; ; ppl: 269.44 ; lr: 0.0001012 ; updates: 300; 42891 src tok/s; 74721 tgt tok/s; 0:03:22 elapsed
Epoch 1, 1300/15756; ; ppl: 228.26 ; lr: 0.0001096 ; updates: 325; 42406 src tok/s; 74165 tgt tok/s; 0:03:37 elapsed
Epoch 1, 1400/15756; ; ppl: 210.37 ; lr: 0.0001180 ; updates: 350; 43133 src tok/s; 73203 tgt tok/s; 0:03:51 elapsed
Epoch 1, 1500/15756; ; ppl: 186.64 ; lr: 0.0001264 ; updates: 375; 43091 src tok/s; 73222 tgt tok/s; 0:04:05 elapsed
Epoch 1, 1600/15756; ; ppl: 171.28 ; lr: 0.0001349 ; updates: 400; 42041 src tok/s; 74044 tgt tok/s; 0:04:19 elapsed
Epoch 1, 1700/15756; ; ppl: 156.90 ; lr: 0.0001433 ; updates: 425; 41715 src tok/s; 74564 tgt tok/s; 0:04:33 elapsed
Epoch 1, 1800/15756; ; ppl: 155.14 ; lr: 0.0001517 ; updates: 450; 43612 src tok/s; 73172 tgt tok/s; 0:04:47 elapsed
Epoch 1, 1900/15756; ; ppl: 139.19 ; lr: 0.0001602 ; updates: 475; 40948 src tok/s; 75693 tgt tok/s; 0:05:02 elapsed
Epoch 1, 2000/15756; ; ppl: 142.13 ; lr: 0.0001686 ; updates: 500; 45241 src tok/s; 72747 tgt tok/s; 0:05:16 elapsed
Epoch 1, 2100/15756; ; ppl: 128.59 ; lr: 0.0001770 ; updates: 525; 43643 src tok/s; 73672 tgt tok/s; 0:05:31 elapsed
Epoch 1, 2200/15756; ; ppl: 122.95 ; lr: 0.0001854 ; updates: 550; 42637 src tok/s; 73098 tgt tok/s; 0:05:45 elapsed
Traceback (most recent call last):
File "/mypath/nmtgminor/train.py", line 495, in
main()
File "/mypath/nmtgminor/train.py", line 491, in main
run_process(0, train_data, valid_data, dicts, opt, checkpoint)
File "/mypath/nmtgminor/train.py", line 64, in run_process
trainer.run(checkpoint=checkpoint)
File "/mypath/nmtgminor/onmt/train_utils/mp_trainer.py", line 992, in run
train_loss = self.train_epoch(epoch, resume=resume, itr_progress=itr_progress)
File "/mypath/nmtgminor/onmt/train_utils/mp_trainer.py", line 681, in train_epoch
batch = prepare_sample(samples, device=self.device)
File "/mypath/nmtgminor/onmt/train_utils/mp_trainer.py", line 53, in prepare_sample
batch.cuda(fp16=False, device=device)
File "/mypath/nmtgminor/onmt/data/dataset.py", line 238, in cuda
self.tensors[key] = self.tensors[key].cuda(device=device)
RuntimeError: CUDA error: uncorrectable NVLink error detected during the execution
terminate called without an active exception