Detection of ROCM vs CUDA device on Clariden

Question

Detection of ROCM vs CUDA device on Clariden

edopao opened this issue 7 months ago · 4 comments

I encountered an issue when running GT4Py with gtfn_gpu backend on Clariden. I run on a GPU node with Nvidia A100, but this code will select the ROCM cupy device:

CUPY_DEVICE: Final[Literal[None, core_defs.DeviceType.CUDA, core_defs.DeviceType.ROCM]] = (
    None
    if not cp
    else (core_defs.DeviceType.ROCM if cp.cuda.get_hipcc_path() else core_defs.DeviceType.CUDA)
)

I suspect that CUDA was installed on Clariden with support for both Nvidia and AMD GPUs, depending on the type of node allocated by Slurm.

You can run this test:

pytest -s -v -k gtfn_gpu tests/next_tests/integration_tests/multi_feature_tests/ffront_tests/test_icon_like_scan.py::test_solve_nonhydro_stencil_52_like

It will produce this output:

        if self.device_type == core_defs.DeviceType.ROCM:
            # until we can rely on dlpack
>           ndarray.__hip_array_interface__ = {  # type: ignore[attr-defined]
                "shape": ndarray.shape,  # type: ignore[union-attr]
                "typestr": ndarray.dtype.descr[0][1],  # type: ignore[union-attr]
                "descr": ndarray.dtype.descr,  # type: ignore[union-attr]
                "stream": 1,
                "version": 3,
                "strides": ndarray.strides,  # type: ignore[union-attr, attr-defined]
                "data": (ndarray.data.ptr, False),  # type: ignore[union-attr, attr-defined]
            }
E           AttributeError: 'ndarray' object has no attribute '__hip_array_interface__'

src/gt4py/storage/allocators.py:270: AttributeError
================================================================= short test summary info ==================================================================
ERROR tests/next_tests/integration_tests/multi_feature_tests/ffront_tests/test_icon_like_scan.py::test_solve_nonhydro_stencil_52_like[gtfn.run_gtfn_gpu] - AttributeError: 'ndarray' object has no attribute '__hip_array_interface__'

Answer 1 · 2024-02-08T10:54:12.000Z

Does the environment have an installation of cupy-rocm or just cupy-cuda? When we wrote that code, there was no clean/documented way to have cupy for both gpu types. Not sure if that changed.

Answer 2 · 2024-02-08T11:08:34.000Z

cupy-cuda11x 13.0.0

>>> import cupy as cp
>>> cp.cuda.get_hipcc_path()
'/usr/bin/hipcc'

Answer 3 · 2024-02-08T11:53:04.000Z

Maybe we should use this variable cp.cuda.runtime.is_hip

Answer 4 · 2024-02-08T12:05:39.000Z

Yes, that seems to work!