microsoft/VideoX

[X-CLIP] Some errors related about cuda during runtime

zhengzehong331 opened this issue · 1 comments

Thank you for your great works! I meet this problem when i train the model with hmdb_51 dataset:

[2023-09-17 14:18:47 ViT-B/16](main.py 181): INFO Train: [0/50][0/3383]	eta 0:49:54 lr 0.000000000	time 0.8851 (0.8851)	tot_loss 2.6029 (2.6029)	mem 8942MB
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "main.py", line 278, in <module>
    main(config)
  File "main.py", line 104, in main
    train_one_epoch(epoch, model, criterion, optimizer, lr_scheduler, train_loader, text_labels, config, mixup_fn)
  File "main.py", line 144, in train_one_epoch
    images, label_id = mixup_fn(images, label_id)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 57, in __call__
    **kwargs)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 214, in do_blending
    return self.do_mixup(imgs, label)
  File "/root/autodl-tmp/VideoX/X-CLIP/datasets/blending.py", line 202, in do_mixup
    mixed_imgs = lam * imgs + (1 - lam) * imgs[rand_index, :]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdb60c737d2 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x267df7a (0x7fdbb3c92f7a in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0x301898 (0x7fdc1608c898 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: c10::TensorImpl::release_resources() + 0x175 (0x7fdb60c5c005 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x1edf69 (0x7fdc15f78f69 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4e5818 (0x7fdc16270818 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x299 (0x7fdc16270b19 in /root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: /root/miniconda3/envs/xclip/bin/python() [0x4a0a87]
frame #8: /root/miniconda3/envs/xclip/bin/python() [0x4b0858]
frame #9: /root/miniconda3/envs/xclip/bin/python() [0x4c5b50]
frame #10: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66]
frame #11: /root/miniconda3/envs/xclip/bin/python() [0x4c5b66]
frame #12: /root/miniconda3/envs/xclip/bin/python() [0x4946f7]
frame #13: PyDict_SetItemString + 0x61 (0x499261 in /root/miniconda3/envs/xclip/bin/python)
frame #14: PyImport_Cleanup + 0x89 (0x56f719 in /root/miniconda3/envs/xclip/bin/python)
frame #15: Py_FinalizeEx + 0x67 (0x56b1a7 in /root/miniconda3/envs/xclip/bin/python)
frame #16: /root/miniconda3/envs/xclip/bin/python() [0x53fc79]
frame #17: _Py_UnixMain + 0x3c (0x53fb3c in /root/miniconda3/envs/xclip/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fdc1897d083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: /root/miniconda3/envs/xclip/bin/python() [0x53f9ee]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 15424) of binary: /root/miniconda3/envs/xclip/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/xclip/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
main.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-17_14:18:51
  host      : autodl-container-7850119152-163467d4
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 15424)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 15424
======================================================

I was very confused and tried many methods but couldn't solve it.

my gpu is NVIDIA GeForce RTX 2080 Ti * 1
image