NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.

Question

NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.

andysingal opened this issue a year ago · 2 comments

Describe the bug

A clear and concise description of what the bug is.
Please share your notebook link so that we can reproduce the error
https://colab.research.google.com/drive/1Mw1K4QuCmnSp6YGFmqpmWtdECWBNVX5X?usp=sharing
ERROR:

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:145: UserWarning: 
NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[WARNING] Trainer has no policies, not saving anything.
Traceback (most recent call last):
  File "/usr/local/bin/mlagents-learn", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "/content/ml-agents/ml-agents/mlagents/trainers/learn.py", line 264, in main
    run_cli(parse_command_line())
  File "/content/ml-agents/ml-agents/mlagents/trainers/learn.py", line 260, in run_cli
    run_training(run_seed, options, num_areas)
  File "/content/ml-agents/ml-agents/mlagents/trainers/learn.py", line 136, in run_training
    tc.start_learning(env_manager)
  File "/content/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 172, in start_learning
    self._reset_env(env_manager)
  File "/content/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 107, in _reset_env
    self._register_new_behaviors(env_manager, env_manager.first_step_infos)
  File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 267, in _register_new_behaviors
    self._create_trainers_and_managers(env_manager, new_behavior_ids)
  File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 165, in _create_trainers_and_managers
    self._create_trainer_and_manager(env_manager, behavior_id)
  File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 137, in _create_trainer_and_manager
    policy = trainer.create_policy(
  File "/content/ml-agents/ml-agents/mlagents/trainers/ppo/trainer.py", line 194, in create_policy
    policy = TorchPolicy(
  File "/content/ml-agents/ml-agents/mlagents/trainers/policy/torch_policy.py", line 41, in __init__
    GlobalSteps()
  File "/content/ml-agents/ml-agents/mlagents/trainers/torch_entities/networks.py", line 748, in __init__
    torch.Tensor([0]).to(torch.int64), requires_grad=False
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Material

Did you use Google Colab?
yes
If not:
Your Operating system (OS)
Version of your OS

Answer 1 · 2023-08-21T11:10:02.000Z

Hey there 👋 just checked the notebook and it seems you don't have this error anymore? 🤔 .

https://huggingface.co/Andyrasika/ppo-Huggy: your model

Answer 2 · 2023-09-13T09:16:59.000Z

Closing the issue for now 🤗