Issue: torch._scaled_mm RuntimeError on RTX 6000 (with runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04)

Question

Issue: torch._scaled_mm RuntimeError on RTX 6000 (with runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04)

veyorokon opened this issue 2 months ago · 2 comments

Description
When using the flux-fp8-api with configuration .configs/config-dev-1-RTX6000ADA.json on an RTX 6000, I receive a RuntimeError regarding unsupported torch._scaled_mm due to compute capability requirements. My environment uses the Docker image runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04.

Docker Image:
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04

Error Details

RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Relevant Configuration Path

config_path = ".configs/config-dev-1-RTX6000ADA.json"

Has anyone encountered this before?

Answer 1 · 2024-10-31T16:09:52.000Z

Oh- it could be that you're using an RTX 6000 - non-ada, which is different than the RTX 6000 ADA. They have a similar name, but one is ada generation, and the other is from last gen, which would have compute capability 8.6.

Answer 2 · 2024-10-31T18:03:37.000Z

gotcha - was wondering about that possibility - ty