NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

Question

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

ZetangForward opened this issue 2 months ago · 1 comments

Hi, I just want to train a small version of RWKV-V5-169m model from scratch
I implement it with huggingface:

import torch
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("RWKV/rwkv-4-169m-pile")
config = AutoConfig.from_pretrained("RWKV/rwkv-4-169m-pile")

tiny_rwkv_configs = {
            "num_hidden_layers": 4,
            "hidden_size": 256,
            "intermediate_size": 1024,
            "attention_hidden_size": 256,
            "vocab_size": 20480,
        }

"""
implement config with tiny_rwkv_configs:
e.g., config.num_hidden_layers = tiny_rwkv_configs['num_hidden_layers']
"""

model = AutoModelForCausalLM.from_config(config)

"""
initialize dataloader, optimizer, etc
""" 
for sample in dataloader:
    outputs = model(sample)
    loss = outputs.loss

But, when I backward the loss, I encounter the bug:

You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Sanity Checking: |                                                   | 0/? [00:00<?, ?it/s]/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
Sanity Checking DataLoader 0:   0%|                                  | 0/1 [00:00<?, ?it/s]/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
Epoch 0:   0%| | 3/1398 [00:00<02:55,  7.95it/s, v_num=tzc, train_lm_loss=nan.0, train_ppl=[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa88159617 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7faa8811498d in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7faa88215128 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7faa8914b250 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7faa8914f078 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7faa89165910 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7faa89165c18 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xc819d (0x7faacd94619d in /home/amax/anaconda3/envs/zecheng/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fab09939609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fab0985e353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa88159617 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7faa8811498d in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7faa88215128 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7faa8914b250 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7faa8914f078 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7faa89165910 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7faa89165c18 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xc819d (0x7faacd94619d in /home/amax/anaconda3/envs/zecheng/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fab09939609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fab0985e353 in /lib/x86_64-linux-gnu/libc.so.6)

Worth noting that I train the mode from scratch, and I only implement 4-layer of RWKV with custom setting, the loss becomes nan.0 @www

Does anyone encounter this issue?

Answer 1 · 2024-04-16T09:51:49.000Z

Hi seems you still need to use RWKV-LM repo to train it