Unit 5 [HANDS-ON BUG]: Training won’t run on A100 GPU

Question

Unit 5 [HANDS-ON BUG]: Training won’t run on A100 GPU

rlanday opened this issue a year ago · 3 comments

I have Colab Pro, so I’m trying to do the hands-on for Unit 5 on an A100 GPU, but I’m running into this error when I try to run mlagents-train:

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:145: UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

It seems the version of PyTorch used by the lab (torch-1.11.0) is out-of-date. I tried upgrading to the latest PyTorch and ran into the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. mlagents 0.31.0.dev0 requires torch<=1.11.0,>=1.8.0, but you have torch 2.0.1+cu118 which is incompatible.

I’m not sure if you can do anything about this at the moment or if ML-Agents needs to be updated first. I will pick one of the other GPUs and try again. If the bug can’t be fixed at the moment, I suggest adding a note to the top of the Colab about which GPUs are supported.

Link to my colab:
https://colab.research.google.com/drive/1tBgTYLTtswy2XQJycNg4HgW92eE2eD0L?usp=sharing

Material

Did you use Google Colab?
Yes

Answer 1 · 2023-06-05T05:27:29.000Z

I have verified that training does work on V100 runtimes

Answer 2 · 2023-06-06T06:07:52.000Z

Thanks for the info, we currently updating the ML Agents version with the official one since we're merging our Hugging Face integration in it.

I keep you updated.

Answer 3 · 2023-06-26T08:20:43.000Z

We updated all the units with official integration of MLAgents 🤗 .

I'm closing the issue