RuntimeError: CUDA error: device-side assert triggered

Question

RuntimeError: CUDA error: device-side assert triggered

PotatoThanh opened this issue 4 years ago · 8 comments

Hi Yoshitomo,

My machine has 2 TitanV + Torch 1.7.1 + Cuda11.0 + TorchVision 0.8.2

I ran python examples/image_classification.py --config configs/sample/ilsvrc2012/single_stage/kd/alexnet_from_resnet152.yaml --log log/ilsvrc2012/kd/alexnet_from_resnet152.txt

Then got error RuntimeError: CUDA error: device-side assert triggered:

2021/02/24 13:53:53 INFO torchdistill.common.main_util Not using distributed mode
2021/02/24 13:53:53 INFO main Namespace(adjust_lr=False, config='configs/sample/ilsvrc2012/single_stage/kd/alexnet_from_resnet152.yaml', device='cuda', dist_url='env://', log='log/ilsvrc2012/kd/alexnet_from_resnet152.txt', start_epoch=0, student_only=False, sync_bn=False, test_only=False, world_size=1)
2021/02/24 13:53:53 INFO torchdistill.datasets.util Loading train data
2021/02/24 13:53:58 INFO torchdistill.datasets.util dataset_id ilsvrc2012/train: 4.215242624282837 sec
2021/02/24 13:53:58 INFO torchdistill.datasets.util Loading val data
2021/02/24 13:53:58 INFO torchdistill.datasets.util dataset_id ilsvrc2012/val: 0.18817710876464844 sec
2021/02/24 13:53:59 INFO torchdistill.common.main_util ckpt file is not found at ./resource/ckpt/ilsvrc2012/teacher/ilsvrc2012-resnet152.pt
2021/02/24 13:54:02 INFO torchdistill.common.main_util ckpt file is not found at ./resource/ckpt/ilsvrc2012/single_stage/kd/ilsvrc2012-alexnet_from_resnet152.pt
2021/02/24 13:54:02 INFO main Start training
2021/02/24 13:54:02 INFO torchdistill.models.util [teacher model]
2021/02/24 13:54:02 INFO torchdistill.models.util Using the original teacher model
2021/02/24 13:54:02 INFO torchdistill.models.util [student model]
2021/02/24 13:54:02 INFO torchdistill.models.util Using the original student model
2021/02/24 13:54:02 INFO torchdistill.core.distillation Loss = 1.0 * OrgLoss
2021/02/24 13:54:02 INFO torchdistill.core.distillation Freezing the whole teacher model
2021/02/24 13:54:06 INFO torchdistill.misc.log Epoch: [0] [ 0/40037] eta: 1 day, 21:05:22 lr: 0.0001 img/s: 11.305412092278724 loss: 7.0715 (7.0715) time: 4.0543 data: 1.2238 max mem: 2885
/opt/conda/conda-bld/pytorch_1607370172916/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "examples/image_classification.py", line 181, in
main(argparser.parse_args())
File "examples/image_classification.py", line 163, in main
train(teacher_model, student_model, dataset_dict, ckpt_file_path, device, device_ids, distributed, config, args)
File "examples/image_classification.py", line 123, in train
train_one_epoch(training_box, device, epoch, log_freq)
File "examples/image_classification.py", line 66, in train_one_epoch
metric_logger.update(loss=loss.item(), lr=training_box.optimizer.param_groups[0]['lr'])
RuntimeError: CUDA error: device-side assert triggered

Answer 1 · 2021-02-24T03:20:14.000Z

Hi @PotatoThanh

I just tried to reproduce the error, but example/image_classification.py is running well on multiple GPUs with configs/sample/ilsvrc2012/single_stage/kd/alexnet_from_resnet152.yaml so far.

Could you provide 1) OS info, 2) Python ver., and 3) torchdistill ver. as well?
Also, if you have made any change on code and/or yaml config file, please share them here too.

Thank you

Answer 2 · 2021-02-24T03:26:34.000Z

Hi @yoshitomo-matsubara,

I am using Ubuntu=20.04, TorchDistill=0.1.4, NvidiaDriver=450.102.04, Torch=1.7.1, Cuda=11.0, TorchVision=0.8.2.

I did not modify anything from your code as well as yaml files. I am trying to reproduce your results on ImageNet.

Thank you!

Answer 3 · 2021-02-24T03:36:05.000Z

@PotatoThanh
And which python version are you using? Your provided environment is more or less the same with mine, so it should be fine as long as you're using Python 3.6 - 3.8 and you follow this instruction for ImageNet dataset

Besides, if you'd like to reproduce the results reported in my paper, please follow the instructions under configs/official/

As noted here, all the config files under configs/sample/ are not tuned, but used mostly for debugging purpose.

Answer 4 · 2021-02-24T03:39:50.000Z

Thank you @yoshitomo-matsubara,

Yes, I am using Python 3.8.5 as well as your instructions for ImageNet. I ran yaml file in configs/sample/. Let me try configs/official/ and see

Answer 5 · 2021-02-24T04:00:10.000Z

Thank you for providing the info @PotatoThanh

I'm assuming you're using the latest version in this repo (currently 627abd5) for image_classification.py.

If you still face the same error, please make sure that your ImageNet folder contains 1000 sub folders only as the following error message implies that sometimes targets contains at least one class index that is out of range 0 - 999

/opt/conda/conda-bld/pytorch_1607370172916/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed.

Answer 6 · 2021-02-24T04:07:08.000Z

Thank you @yoshitomo-matsubara,

I found the problem. When I preprocess the ImageNet using

mkdir ./resource/dataset/ilsvrc2012/{train,val} -p
mv ILSVRC2012_img_train.tar ./resource/dataset/ilsvrc2012/train/
cd ./resource/dataset/ilsvrc2012/train/
tar -xvf ILSVRC2012_img_train.tar
for f in *.tar; do
d=basename $f .tar
mkdir $d
(cd $d && tar xf ../$f)
done
rm -r *.tar

There is a folder name ILSVRC2012_img_train under ./resource/dataset/ilsvrc2012/train/. Therefore, when the code loads data, it will raise error.

Answer 7 · 2021-02-24T04:13:01.000Z

@PotatoThanh
That makes sense. Probably I forgot to add a couple of commands to rename and delete files or something.
Thank you for pointing out! I'll update the instructions.

Answer 8 · 2021-02-24T05:03:18.000Z

Actually, I double-checked on Ubuntu 18 that tar -xvf ILSVRC2012_img_train.tar does not produce a folder ILSVRC2012_img_train/ but found some typos in commands to process validation dataset instead.
So the initial commands should be fine for training dataset.