Sample for 'Getting started with PyTorch' doesn't work (GPU variant)
lars-at-styx opened this issue · 3 comments
Describe the bug
I have been following this tutorial for getting started with Pytorch on the AI Platform: https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1
I've tried to follow the instructions to the letter, and I've used the provided sample code. Yet I still run into errors. I can create a job without problems, but the job ends up failing with the following error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
I'm relatively new to PyTorch and completely new to GCP, so I have no idea how I would go about fixing this and any help would be much appreciated.
What sample is this bug related to?
The sample in this repository at training/pytorch/structured/python_package
Source code / logs
The source code is unchanged from the sample in this repository. The relevant logs are these:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 123, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 119, in main
experiment.run(args)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 132, in run
train(sequential_model, train_loader, criterion, optimizer, epoch)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 37, in train
for batch_index, data in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 347, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 387, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: CUDA error: no kernel image is available for execution on the device
To Reproduce
Follow the steps at:
https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu
Then go to the 'AI Platform' tab in the online console of Google Cloud Platform. Click the option 'Jobs' and you should find a job there with the name getting_started_pytorch_gpu
. It is initially pending, and after a few minutes it changed to status 'Failed'.
Expected behavior
That the job would succeed and the result would be saved to a bucket.
System Information
- OS Platform and Distribution: Ubuntu 20.04 LTS is run locally, but the error occurs using the
gcr.io/cloud-ml-public/training/pytorch-gpu.1-6
container. - Framework and version: Pytorch, as provided by the
gcr.io/cloud-ml-public/training/pytorch-gpu.1-6
container) - Python version: Unknown. I'm using the
gcr.io/cloud-ml-public/training/pytorch-gpu.1-6
container. - Exact command to reproduce: See https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1
Hi, sorry for the inconvenience. We were also able to reproduce the error with the PyTorch GPU 1.6 image, and are looking into it on our end.
In the meantime, a workaround for the issue would be to use a different GPU. The sample submits a job using the BASIC_GPU
scale tier, which uses a K80 GPU. We found that the issue didn't occur with other GPU options. You can use a command like the follows to submit a job with a P100 GPU, which should work.
gcloud ai-platform jobs submit training ${JOB_NAME} \
--region=us-central1 \
--master-image-uri=gcr.io/cloud-ml-public/training/pytorch-gpu.1-6 \
--scale-tier=CUSTOM \
--master-machine-type=n1-standard-8 \
--master-accelerator=type=nvidia-tesla-p100,count=1 \
--job-dir=${JOB_DIR} \
--package-path=./trainer \
--module-name=trainer.task \
-- \
--train-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_train.csv \
--eval-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_eval.csv \
--num-epochs=10 \
--batch-size=100 \
--learning-rate=0.001
Please let us know if you are still seeing this issue on the latest PyTorch images. PyTorch 1.10 is the latest version currently available - release notes.
We have verified that our more recent PyTorch images support K80 GPUs, so this issue can be resolved.