huggingface/deep-rl-class

Unit 5 [HANDS-ON BUG]: Training won’t run on A100 GPU

rlanday opened this issue · 3 comments

I have Colab Pro, so I’m trying to do the hands-on for Unit 5 on an A100 GPU, but I’m running into this error when I try to run mlagents-train:

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:145: UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

It seems the version of PyTorch used by the lab (torch-1.11.0) is out-of-date. I tried upgrading to the latest PyTorch and ran into the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. mlagents 0.31.0.dev0 requires torch<=1.11.0,>=1.8.0, but you have torch 2.0.1+cu118 which is incompatible.

I’m not sure if you can do anything about this at the moment or if ML-Agents needs to be updated first. I will pick one of the other GPUs and try again. If the bug can’t be fixed at the moment, I suggest adding a note to the top of the Colab about which GPUs are supported.

Link to my colab:
https://colab.research.google.com/drive/1tBgTYLTtswy2XQJycNg4HgW92eE2eD0L?usp=sharing

Material

  • Did you use Google Colab?
    Yes

I have verified that training does work on V100 runtimes

Thanks for the info, we currently updating the ML Agents version with the official one since we're merging our Hugging Face integration in it.

I keep you updated.

We updated all the units with official integration of MLAgents 🤗 .

I'm closing the issue