Switching to device ≠ 1 hangs on multi-GPU node
luraess opened this issue · 2 comments
Short description
Switching to device ≠ 1 on a multi-GPU node (LUMI MI250x) results in Julia hanging. Changing the default_device
allows to switch device at Julia startup, while changing device
even at startup fails.
This behaviour occurs on AMDGPU v0.4.13
and on AMDGPU#master
julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm/lib/libhsa-runtime64.so.1
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so.5
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (12):
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- CPU-XX [AMD EPYC 7A53 64-Core Processor]
- GPU-c96383daa806fd67 [gfx90a]
- GPU-9a4356d61a3e2421 [gfx90a]
- GPU-5b303e9096b8783a [gfx90a]
- GPU-0b64175228ee7027 [gfx90a]
- GPU-929a3274a9566ca5 [gfx90a]
- GPU-2b0e86a577d2d025 [gfx90a]
- GPU-1b0f4c6aeb8256b1 [gfx90a]
- GPU-9f26db390c83e0eb [gfx90a]
Details
Allocating memory on default device 1 works as expected
julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
GPU-c96383daa806fd67 [gfx90a]
GPU-9a4356d61a3e2421 [gfx90a]
GPU-5b303e9096b8783a [gfx90a]
GPU-0b64175228ee7027 [gfx90a]
GPU-929a3274a9566ca5 [gfx90a]
GPU-2b0e86a577d2d025 [gfx90a]
GPU-1b0f4c6aeb8256b1 [gfx90a]
GPU-9f26db390c83e0eb [gfx90a]
julia> AMDGPU.default_device_id()
1
julia> AMDGPU.ones(4,4)
4×4 ROCMatrix{Float32}:
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
Switching to default device and repeating allocation fails:
julia> AMDGPU.default_device_id!(2)
GPU-9a4356d61a3e2421 [gfx90a]
julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-c96383daa806fd67 [gfx90a] at 0x000015266ffe4000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
[1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
[2] wait
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
[3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
[4] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
[5] #gpu_call#1
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
[6] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
[7] fill!
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
[8] ones
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
[9] ones(::Int64, ::Int64)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
[10] top-level scope
@ REPL[11]:1
julia>
Using device
instead of default_device
to switch from defaut device fails right-away:
julia> AMDGPU.devices()
8-element Vector{ROCDevice}:
GPU-c96383daa806fd67 [gfx90a]
GPU-9a4356d61a3e2421 [gfx90a]
GPU-5b303e9096b8783a [gfx90a]
GPU-0b64175228ee7027 [gfx90a]
GPU-929a3274a9566ca5 [gfx90a]
GPU-2b0e86a577d2d025 [gfx90a]
GPU-1b0f4c6aeb8256b1 [gfx90a]
GPU-9f26db390c83e0eb [gfx90a]
julia> AMDGPU.device!(AMDGPU.devices()[2]))
ERROR: syntax: extra token ")" after end of expression
Stacktrace:
[1] top-level scope
@ none:1
julia> AMDGPU.device!(AMDGPU.devices()[2])
GPU-9a4356d61a3e2421 [gfx90a]
julia> AMDGPU.ones(4,4)
┌ Error: Memory Fault on GPU-9a4356d61a3e2421 [gfx90a] at 0x00001501ec20c000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
[1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:63
[2] wait
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/runtime/kernel-signal.jl:32 [inlined]
[3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCMatrix{Float32}, Float32}, threads::Int64, blocks::Int64; name::Nothing)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:13
[4] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:11 [inlined]
[5] #gpu_call#1
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
[6] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
[7] fill!
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14 [inlined]
[8] ones
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:383 [inlined]
[9] ones(::Int64, ::Int64)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/array.jl:382
[10] top-level scope
@ REPL[5]:1
julia>
After failure, exiting Julia results in following stack trace:
julia>
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
__sched_yield at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x150418209464)
unknown function (ip: 0x15041820cace)
unknown function (ip: 0x1504182190f8)
unknown function (ip: 0x15041816f1ed)
hipStreamDestroy at /opt/rocm/lib/libamdhip64.so.5 (unknown line)
hipStreamDestroy at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/libhip.jl:59
#7 at /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/9VGcR/src/hip/HIP.jl:111
unknown function (ip: 0x1501f3a97332)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
run_finalizer at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:417
jl_gc_run_finalizers_in_list at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:507
run_finalizers at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gc.c:553
ijl_atexit_hook at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/init.c:299
jl_repl_entrypoint at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/jlapi.c:718
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
In the case where switching device with device!
, I am getting the queue on the correct device but then it faults
julia> AMDGPU.device!(AMDGPU.devices()[2])
GPU-9a4356d61a3e2421 [gfx90a]
julia> AMDGPU.queue()
ROCQueue(device=GPU-9a4356d61a3e2421 [gfx90a], ptr=0x000014ee43244000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)
julia> AMDGPU.ones(4,1)
┌ Error: Memory Fault on GPU-9a4356d61a3e2421 [gfx90a] at 0x000014ee4320e000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/BOrCS/src/runtime/fault.jl:25
If using default_device_id!
to switch device, it works to select ≠ 1 at Julia session start, but then fails when switching device after having allocated memory using previous device. Printing the queue, it seems in this case that the active queue is not changed to the correct device:
julia> AMDGPU.default_device_id!(3)
GPU-5b303e9096b8783a [gfx90a]
julia> AMDGPU.queue()
ROCQueue(device=GPU-5b303e9096b8783a [gfx90a], ptr=0x00001493b8670000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)
julia> AMDGPU.ones(4,1)
4×1 ROCMatrix{Float32}:
1.0
1.0
1.0
1.0
julia> AMDGPU.queue()
ROCQueue(device=GPU-5b303e9096b8783a [gfx90a], ptr=0x00001493b8670000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)
julia> AMDGPU.default_device_id!(1)
GPU-c96383daa806fd67 [gfx90a]
julia> AMDGPU.queue()
ROCQueue(device=GPU-5b303e9096b8783a [gfx90a], ptr=0x00001493b8670000, priority=normal, status=HSA_STATUS_SUCCESS, active=true, running=false)
julia> AMDGPU.ones(4,1)
┌ Error: Memory Fault on GPU-5b303e9096b8783a [gfx90a] at 0x00001493b8618000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/BOrCS/src/runtime/fault.jl:25
@jpsamaroo @pxl-th switching device in the current task using device_id!
throws a kernel signal exception and seems to hang in wait
in the array constructor
Lines 11 to 14 in e380ffa
It actually seems the fill!
has an issue as following reproduces the hang:
julia> b = ROCArray{Float64}(undef, 4)
4-element ROCVector{Float64}:
-1.8325506472120096e-6
-1.8325506472120096e-6
-1.8325506472120096e-6
-1.8325506472120096e-6
julia> fill!(b, 1.0)
┌ Error: Memory Fault on GPU-b87e3967a84c0da7 [gfx90a] at 0x0000154eacbfc000:
│ PAGE_NOT_PRESENT, READ_ONLY, NX, HOST_ONLY, DRAMECC, IMPRECISE, SRAMECC, HANG
│ GPU kernels are now hung, please restart Julia
└ @ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/runtime/fault.jl:25
^CERROR: InterruptException:
Stacktrace:
[1] wait(kersig::AMDGPU.Runtime.ROCKernelSignal; check_exceptions::Bool, cleanup::Bool, signal_kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ AMDGPU.Runtime /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/runtime/kernel-signal.jl:63
[2] wait
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/runtime/kernel-signal.jl:32 [inlined]
[3] gpu_call(::AMDGPU.ROCArrayBackend, f::Function, args::Tuple{ROCVector{Float64}, Float64}, threads::Int64, blocks::Int64; name::Nothing)
@ AMDGPU /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/array.jl:13
[4] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/5wwaC/src/array.jl:11 [inlined]
[5] #gpu_call#1
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:65 [inlined]
[6] gpu_call
@ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/device/execution.jl:34 [inlined]
[7] fill!(A::ROCVector{Float64}, x::Float64)
@ GPUArrays /scratch/project_465000557/lurass/julia_local/julia_depot/packages/GPUArrays/TnEpb/src/host/construction.jl:14
[8] top-level scope
@ REPL[10]:1
julia>