GoogleCloudPlatform/ai-platform-samples

Sample for 'Getting started with PyTorch' doesn't work (GPU variant)

lars-at-styx opened this issue · 3 comments

Describe the bug

I have been following this tutorial for getting started with Pytorch on the AI Platform: https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1

I've tried to follow the instructions to the letter, and I've used the provided sample code. Yet I still run into errors. I can create a job without problems, but the job ends up failing with the following error:

RuntimeError: CUDA error: no kernel image is available for execution on the device

I'm relatively new to PyTorch and completely new to GCP, so I have no idea how I would go about fixing this and any help would be much appreciated.

What sample is this bug related to?

The sample in this repository at training/pytorch/structured/python_package

Source code / logs

The source code is unchanged from the sample in this repository. The relevant logs are these:

The replica master 0 exited with a non-zero status of 1. 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 123, in <module>
    main()
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 119, in main
    experiment.run(args)
  File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 132, in run
    train(sequential_model, train_loader, criterion, optimizer, epoch)
  File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 37, in train
    for batch_index, data in enumerate(train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 347, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 387, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: CUDA error: no kernel image is available for execution on the device

To Reproduce

Follow the steps at:

https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu

Then go to the 'AI Platform' tab in the online console of Google Cloud Platform. Click the option 'Jobs' and you should find a job there with the name getting_started_pytorch_gpu. It is initially pending, and after a few minutes it changed to status 'Failed'.

Expected behavior
That the job would succeed and the result would be saved to a bucket.

System Information

  • OS Platform and Distribution: Ubuntu 20.04 LTS is run locally, but the error occurs using the gcr.io/cloud-ml-public/training/pytorch-gpu.1-6 container.
  • Framework and version: Pytorch, as provided by the gcr.io/cloud-ml-public/training/pytorch-gpu.1-6 container)
  • Python version: Unknown. I'm using the gcr.io/cloud-ml-public/training/pytorch-gpu.1-6 container.
  • Exact command to reproduce: See https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1

Hi, sorry for the inconvenience. We were also able to reproduce the error with the PyTorch GPU 1.6 image, and are looking into it on our end.

In the meantime, a workaround for the issue would be to use a different GPU. The sample submits a job using the BASIC_GPU scale tier, which uses a K80 GPU. We found that the issue didn't occur with other GPU options. You can use a command like the follows to submit a job with a P100 GPU, which should work.

gcloud ai-platform jobs submit training ${JOB_NAME} \
  --region=us-central1 \
  --master-image-uri=gcr.io/cloud-ml-public/training/pytorch-gpu.1-6 \
  --scale-tier=CUSTOM \
  --master-machine-type=n1-standard-8 \
  --master-accelerator=type=nvidia-tesla-p100,count=1 \
  --job-dir=${JOB_DIR} \
  --package-path=./trainer \
  --module-name=trainer.task \
  -- \
  --train-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_train.csv \
  --eval-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_eval.csv \
  --num-epochs=10 \
  --batch-size=100 \
  --learning-rate=0.001

Please let us know if you are still seeing this issue on the latest PyTorch images. PyTorch 1.10 is the latest version currently available - release notes.

We have verified that our more recent PyTorch images support K80 GPUs, so this issue can be resolved.