ray-project/ray

[Core] Incorrectly detected TPU on a HPU-only node.

Opened this issue · 2 comments

What happened + What you expected to happen

image

The author of this PR runs a distributed training workload on a 8-HPU node, however, ray detects there's an additional TPU in the cluster. It could be a ray core's device detection bug.

Versions / Dependencies

nightly

Reproduction script

Issue Severity

Low: It annoys or frustrates me.

@allenwang28 would you mind taking a look?

Thanks for the tag! Does the HPU node have something listed at /dev/vfio or /dev/accel*?