JuliaGPU/AMDGPU.jl

Switching to device ≠ 1 hangs on multi-GPU node

luraess opened this issue · 2 comments

Short description

Switching to device ≠ 1 on a multi-GPU node (LUMI MI250x) results in Julia hanging. Changing the default_device allows to switch device at Julia startup, while changing device even at startup fails.

This behaviour occurs on AMDGPU v0.4.13 and on AMDGPU#master

julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm/lib/libhsa-runtime64.so.1
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so.5
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (12):
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- GPU-c96383daa806fd67 [gfx90a]
- GPU-9a4356d61a3e2421 [gfx90a]
- GPU-5b303e9096b8783a [gfx90a]
- GPU-0b64175228ee7027 [gfx90a]
- GPU-929a3274a9566ca5 [gfx90a]
- GPU-2b0e86a577d2d025 [gfx90a]
- GPU-1b0f4c6aeb8256b1 [gfx90a]
- GPU-9f26db390c83e0eb [gfx90a]

Details

Allocating memory on default device 1 works as expected

julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
 GPU-c96383daa806fd67 [gfx90a]
 GPU-9a4356d61a3e2421 [gfx90a]
 GPU-5b303e9096b8783a [gfx90a]
 GPU-0b64175228ee7027 [gfx90a]
 GPU-929a3274a9566ca5 [gfx90a]
 GPU-2b0e86a577d2d025 [gfx90a]
 GPU-1b0f4c6aeb8256b1 [gfx90a]
 GPU-9f26db390c83e0eb [gfx90a]

 julia> AMDGPU.default_device_id()
1

julia> AMDGPU.ones(4,4)
4×4 ROCMatrix{Float32}:
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0

Switching to default device and repeating allocation fails:

julia> AMDGPU.default_device_id!(2)
GPU-9a4356d61a3e2421 [gfx90a]

julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-c96383daa806fd67 [gfx90a] at 0x000015266ffe4000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
  [1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
  [2] wait
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
  [3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
  [4] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
  [5] #gpu_call#1
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
  [6] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
  [7] fill!
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
  [8] ones
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
  [9] ones(::Int64, ::Int64)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
 [10] top-level scope
    @ REPL[11]:1

julia> 

Using device instead of default_device to switch from defaut device fails right-away:

julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
 GPU-c96383daa806fd67 [gfx90a]
 GPU-9a4356d61a3e2421 [gfx90a]
 GPU-5b303e9096b8783a [gfx90a]
 GPU-0b64175228ee7027 [gfx90a]
 GPU-929a3274a9566ca5 [gfx90a]
 GPU-2b0e86a577d2d025 [gfx90a]
 GPU-1b0f4c6aeb8256b1 [gfx90a]
 GPU-9f26db390c83e0eb [gfx90a]

julia> AMDGPU.device!(AMDGPU.devices()[2]))
ERROR: syntax: extra token ")" after end of expression
Stacktrace:
 [1] top-level scope
   @ none:1

julia> AMDGPU.device!(AMDGPU.devices()[2])
GPU-9a4356d61a3e2421 [gfx90a]

julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-9a4356d61a3e2421 [gfx90a] at 0x00001501ec20c000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
  [1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
  [2] wait
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
  [3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
  [4] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
  [5] #gpu_call#1
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
  [6] gpu_call
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
  [7] fill!
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
  [8] ones
    @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
  [9] ones(::Int64, ::Int64)
    @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
 [10] top-level scope
    @ REPL[5]:1

julia> 

After failure, exiting Julia results in following stack trace:

julia> 
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
__sched_yield at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x150418209464)
unknown function (ip: 0x15041820cace)
unknown function (ip: 0x1504182190f8)
unknown function (ip: 0x15041816f1ed)
hipStreamDestroy at /opt/rocm/lib/libamdhip64.so.5 (unknown line)
hipStreamDestroy at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/libhip.jl:59
#7 at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/HIP.jl:111
unknown function (ip: 0x1501f3a97332)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
run_finalizer at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:417
jl_gc_run_finalizers_in_list at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:507
run_finalizers at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:553
ijl_atexit_hook at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/init.c:299
jl_repl_entrypoint at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/jlapi.c:718
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

In the case where switching device with device!, I am getting the queue on the correct device but then it faults

julia> AMDGPU.device!(AMDGPU.devices()[2])
GPU-9a4356d61a3e2421 [gfx90a]

julia> AMDGPU.queue()
ROCQueue(device=GPU-9a4356d61a3e2421 [gfx90a], ptr=0x000014ee43244000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)

julia> AMDGPU.ones(4,1)
┌ Error: Memory Fault on GPU-9a4356d61a3e2421 [gfx90a] at 0x000014ee4320e000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/BOrCS/src/runtime/fault.jl:25

If using default_device_id! to switch device, it works to select ≠ 1 at Julia session start, but then fails when switching device after having allocated memory using previous device. Printing the queue, it seems in this case that the active queue is not changed to the correct device:

julia> AMDGPU.default_device_id!(3)
GPU-5b303e9096b8783a [gfx90a]

julia> AMDGPU.queue()
ROCQueue(device=GPU-5b303e9096b8783a [gfx90a], ptr=0x00001493b8670000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)

julia> AMDGPU.ones(4,1)
4×1 ROCMatrix{Float32}:
 1.0
 1.0
 1.0
 1.0

julia> AMDGPU.queue()
ROCQueue(device=GPU-5b303e9096b8783a [gfx90a], ptr=0x00001493b8670000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)

julia> AMDGPU.default_device_id!(1)
GPU-c96383daa806fd67 [gfx90a]

julia> AMDGPU.queue()
ROCQueue(device=GPU-5b303e9096b8783a [gfx90a], ptr=0x00001493b8670000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)

julia> AMDGPU.ones(4,1)
┌ Error: Memory Fault on GPU-5b303e9096b8783a [gfx90a] at 0x00001493b8618000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/BOrCS/src/runtime/fault.jl:25

@jpsamaroo @pxl-th switching device in the current task using device_id! throws a kernel signal exception and seems to hang in wait in the array constructor

AMDGPU.jl/src/array.jl

Lines 11 to 14 in e380ffa

function GPUArrays.gpu_call(::ROCArrayBackend, f, args, threads::Int, blocks::Int; name::Union{String,Nothing})
groupsize, gridsize = threads, blocks * threads
wait(@roc groupsize=groupsize gridsize=gridsize f(ROCKernelContext(), args...))
end

It actually seems the fill! has an issue as following reproduces the hang:

julia> b = ROCArray{Float64}(undef, 4)
4-element ROCVector{Float64}:
 -1.8325506472120096e-6
 -1.8325506472120096e-6
 -1.8325506472120096e-6
 -1.8325506472120096e-6

julia> fill!(b, 1.0)
┌ Error: Memory Fault on GPU-b87e3967a84c0da7 [gfx90a] at 0x0000154eacbfc000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
 [1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/runtime/kernel-signal.jl:63
 [2] wait
   @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/runtime/kernel-signal.jl:32 [inlined]
 [3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCVector{Float64}, Float64}, threads::Int64, blocks::Int64; name::Nothing)
   @ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/array.jl:13
 [4] gpu_call
   @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/array.jl:11 [inlined]
 [5] #gpu_call#1
   @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
 [6] gpu_call
   @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
 [7] fill!(A::ROCVector{Float64}, x::Float64)
   @ GPUArrays /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14
 [8] top-level scope
   @ REPL[10]:1

julia>