merge_sort: failed to synchronize during training

Question

merge_sort: failed to synchronize during training

gitouni opened this issue 2 years ago · 0 comments

Thank you for sharing your advanced work.
I met CUDA merge_sort error when training on 3DMatch.

Environment:

Ubuntu 20.04
CUDA 11.1
Python 3.8
MinkwoskiEngine v0.5.3

command

python train.py --voxel_size 0.05 --threed_match_dir threedmatch

output

INFO - 2022-09-06 10:57:42,406 - data_loaders - Resetting the data loader seed to 0                                                                                              
INFO - 2022-09-06 10:57:54,158 - trainer - Validation iter 101 / 400 : Data Loading Time: 0.054, Feature Extraction Time: 0.023, Matching Time: 0.035, Loss: 0.578, RTE: 1.279, R
RE: 0.498, Hit Ratio: 0.071, Feat Match Ratio: 0.495                                                                                                                             
INFO - 2022-09-06 10:58:05,445 - trainer - Validation iter 201 / 400 : Data Loading Time: 0.052, Feature Extraction Time: 0.023, Matching Time: 0.034, Loss: 0.576, RTE: 1.186, R
RE: 0.488, Hit Ratio: 0.067, Feat Match Ratio: 0.478                                                                                                                             
INFO - 2022-09-06 10:58:17,300 - trainer - Validation iter 301 / 400 : Data Loading Time: 0.056, Feature Extraction Time: 0.022, Matching Time: 0.035, Loss: 0.556, RTE: 1.133, R
RE: 0.471, Hit Ratio: 0.073, Feat Match Ratio: 0.502                                                                                                                             
INFO - 2022-09-06 10:58:28,882 - trainer - Final Loss: 0.554, RTE: 1.140, RRE: 0.458, Hit Ratio: 0.072, Feat Match Ratio: 0.490 
Traceback (most recent call last):                                                                                                                                        [0/903]
  File "train.py", line 81, in <module>
    main(config)
  File "train.py", line 57, in main
    trainer.train()
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 130, in train 
    self._train_epoch(epoch)
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 495, in _train_epoch
    loss.backward()
  File "/home/bit/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bit/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward( 
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

With torch.autograd.detect_anomaly():

[W python_anomaly_mode.cpp:104] Warning: Error detected in IndexBackward. Traceback of forward call that caused the error:
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/train.py", line 78, in <module>
    main(config)
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/train.py", line 55, in main
    trainer.train()
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 130, in train
    self._train_epoch(epoch)
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 485, in _train_epoch
    pos_loss, neg_loss = self.contrastive_hardest_negative_loss(
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 447, in contrastive_hardest_negative_loss
    neg_loss1 = F.relu(self.neg_thresh - D10min[mask1]).pow(2)

I have changed the branch version to v0.5 and that error really confuses me.