JuliaGPU/AMDGPU.jl

Support for multi-GPU nodes broken in 0.7

Keluaa opened this issue · 3 comments

Keluaa commented

I am on a 2 GPU node:

julia> using AMDGPU
ERROR: InitError: BoundsError: attempt to access 1-element Vector{AMDGPU.HIP.HIPDevice} at index [2]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:13 [inlined]
  [2] AMDGPU.HIP.HIPDevice(device_id::Int64)
    @ AMDGPU.HIP /briandl/.julia/packages/AMDGPU/aIM2W/src/hip/device.jl:12
  [3] devices()
    @ AMDGPU.HIP /briandl/.julia/packages/AMDGPU/aIM2W/src/hip/device.jl:87
  [4] __init__()
    @ AMDGPU /briandl/.julia/packages/AMDGPU/aIM2W/src/AMDGPU.jl:208
  [5] register_restored_modules(sv::Core.SimpleVector, pkg::Base.PkgId, path::String)
    @ Base ./loading.jl:1115
  [6] _include_from_serialized(pkg::Base.PkgId, path::String, ocachepath::String, depmods::Vector{Any})
    @ Base ./loading.jl:1061
  [7] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt128)
    @ Base ./loading.jl:1506
  [8] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1783
  [9] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1660
 [10] macro expansion
    @ ./loading.jl:1648 [inlined]
 [11] macro expansion
    @ ./lock.jl:267 [inlined]
 [12] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1611
during initialization of module AMDGPU

I believe that this is due to the fact that ALL_DEVICES is filled by HIP.devices() which relies on the HIPDevice constructor which may be missing a bounds check here.

pxl-th commented

My bad, should be fixed by #528.
Would be great if you can confirm

Keluaa commented

should be fixed by #528.

Yep, the issue is gone.

pxl-th commented

Thanks!