GPUOpen-LibrariesAndSDKs/HIPRTSDK

Segfaults in hiprtCreateGeometry when using distribution packaged hip

littlewu2508 opened this issue · 9 comments

I'm using 6700XT on Gentoo dev-util/hip-5.6.0 with upstream clang-16.0.6, and hiprt buildID_linux.txt: 453.

00_context_creation passed.

When executing 01_geom_intersection64D, it segfaults. The stack trace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/opt/gentoo/lib64/libthread_db.so.1".
[New Thread 0x7fffe8dff6c0 (LWP 3748500)]
[New Thread 0x7fffe3fff6c0 (LWP 3748501)]
[Thread 0x7fffe3fff6c0 (LWP 3748501) exited]
hiprt ver.02000
Executing on 'AMD Radeon RX 6700 XT'

Thread 1 "01_geom_interse" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) thread apply all bt

Thread 2 (Thread 0x7fffe8dff6c0 (LWP 3748500) "01_geom_interse"):
#0  0x00007ffff7a47c1b in ioctl () from /opt/gentoo/lib64/libc.so.6
#1  0x00007ffff78f1df0 in ?? () from /opt/gentoo/usr/lib64/libhsakmt.so.1
#2  0x00007ffff78eb295 in hsaKmtWaitOnMultipleEvents () from /opt/gentoo/usr/lib64/libhsakmt.so.1
#3  0x00007ffff52dc285 in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#4  0x00007ffff52b859e in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#5  0x00007ffff52d1f6a in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#6  0x00007ffff527e537 in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#7  0x00007ffff79d0299 in ?? () from /opt/gentoo/lib64/libc.so.6
#8  0x00007ffff7a5332c in ?? () from /opt/gentoo/lib64/libc.so.6

Thread 1 (Thread 0x7ffff7ea5740 (LWP 3748497) "01_geom_interse"):
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7f3804a in ?? () from ../../hiprt/linux64/libhiprt0200064.so
#2  0x00007ffff7f621a4 in hiprtCreateGeometry () from ../../hiprt/linux64/libhiprt0200064.so
#3  0x000055555556e2eb in Tutorial::run (this=0x7fffffffc3e0) at ../01_geom_intersection/main.cpp:69
#4  0x000055555556deb6 in main (argc=1, argv=0x7fffffffc548) at ../01_geom_intersection/main.cpp:96

If I use the amd's rocm distribution (at /opt/rocm), then it's the same issue with #15 (comment)

Hi @littlewu2508 We have released a new version on https://gpuopen.com/hiprt/ Could you please try this version and let us know if the issue still persists?

Confirms that with the newest hiprtsdk (2.1.c202dac) the issue perssits.

Also I uses the Orochi bundled by hiprtsdk-2.0.0 because the newest one cause errors even on 00_context_creation:

Starting program: /data/wuyy/hiprt-2.1/tutorials/dist/bin/Debug/00_context_creation64D 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/opt/gentoo/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) thread apply all bt

Thread 1 (Thread 0x7ffff7e9e740 (LWP 3483337) "00_context_crea"):
#0  0x0000000000000000 in ?? ()
#1  0x0000555555559cf2 in oroGetErrorString (error=4294967295, pStr=0x7fffffffbf50) at ../../contrib/Orochi/Orochi/Orochi.cpp:242
#2  0x0000555555572ed6 in checkOro (res=4294967295, file=0x55555557b4e8 "../00_context_creation/main.cpp", line=29) at ../common/TutorialBase.cpp:33
#3  0x000055555556dbb5 in main (argc=1, argv=0x7fffffffc418) at ../00_context_creation/main.cpp:29

Which version of ROCm do you use? The provided binaries are available with 5.7 (https://repo.radeon.com/amdgpu-install/23.20/ubuntu/focal/).

Which version of ROCm do you use? The provided binaries are available with 5.7 (https://repo.radeon.com/amdgpu-install/23.20/ubuntu/focal/).

I am using ROCm 5.7.1

Just confirming, neither 2.1-alt1.gc202dac nor v2.2.0e68f54 doesn't work with ROCm 5.7.1, I'm getting segfaults for all tutorials:

example segfault bt with 2.1

(gdb) run
Starting program: /opt/git/upstream/HIPRTSDK/tutorials/dist/bin/DebugGpu/01_geom_intersection64D 
Downloading separate debug info for system-supplied DSO at 0x7ffff7fc8000                                                                                                                                                                                                                 
Downloading separate debug info for /usr/lib64/libhiprt0200164.so                                                                                                                                                                                                                         
[Thread debugging using libthread_db enabled]                                                                                                                                                                                                                                             
Using host libthread_db library "/lib64/libthread_db.so.1".
Downloading separate debug info for /usr/lib64/libamdhip64.so
Missing separate debuginfo for /usr/lib64/libamdhip64.so.                                                                                                                                                                                                                                 
Try to install the hash file /usr/lib/debug/.build-id/10/78f70f65ce207875e9f834533bc0763834fdf2.debug
Downloading separate debug info for /usr/lib64/libhiprtc.so                                                                                                                                                                                                                               
Missing separate debuginfo for /usr/lib64/libhiprtc.so.                                                                                                                                                                                                                                   
Try to install the hash file /usr/lib/debug/.build-id/6a/dc9289b47bd759efbddd543d6361c8089e52d3.debug
[New Thread 0x7fffeda196c0 (LWP 150839)]
[New Thread 0x7ffeed1ff6c0 (LWP 150840)]
[Thread 0x7ffeed1ff6c0 (LWP 150840) exited]
hiprt ver.02001
Executing on 'AMD Radeon RX 6700 XT'

Thread 1 "01_geom_interse" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) thread apply all bt

Thread 2 (Thread 0x7fffeda196c0 (LWP 150839) "01_geom_interse"):
#0  __GI___ioctl (fd=fd@entry=3, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007ffff7d25f48 in kmtIoctl (fd=3, request=request@entry=3222817548, arg=arg@entry=0x7fffeda18bc0) at /usr/src/debug/roct-thunk-interface-5.7.1/src/libhsakmt.c:13
#2  0x00007ffff7d27150 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffeda18c70, Milliseconds=4294967294, WaitOnAll=<optimized out>, NumEvents=3, Events=0x7fffeda18d00) at /usr/src/debug/roct-thunk-interface-5.7.1/src/events.c:407
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffeda18d00, NumEvents=3, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffeda18c70) at /usr/src/debug/roct-thunk-interface-5.7.1/src/events.c:378
#4  0x00007fffedc7d2be in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=6, hsa_signals=hsa_signals@entry=0x7ffee8000de0, conds=conds@entry=0x7ffee8000be0, values=values@entry=0x7ffee8000e30, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>) at /usr/src/debug/rocr-runtime-5.7.1/src/core/runtime/signal.cpp:321
#5  0x00007fffedc5b21e in rocr::AMD::hsa_amd_signal_wait_any (signal_count=6, hsa_signals=0x7ffee8000de0, conds=0x7ffee8000be0, values=0x7ffee8000e30, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffeda18e38) at /usr/src/debug/rocr-runtime-5.7.1/src/core/runtime/hsa_ext_amd.cpp:572
#6  0x00007fffedc75bda in rocr::core::Runtime::AsyncEventsLoop () at /usr/src/debug/rocr-runtime-5.7.1/src/core/runtime/runtime.cpp:1125
#7  0x00007fffedc277b7 in rocr::os::ThreadTrampoline (arg=<optimized out>) at /usr/src/debug/rocr-runtime-5.7.1/src/core/util/lnx/os_linux.cpp:80
#8  0x00007ffff78a392b in start_thread (arg=<optimized out>) at pthread_create.c:444
#9  0x00007ffff7925cb8 in clone3 () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7dab740 (LWP 150801) "01_geom_interse"):
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7f0306f in ?? () from /usr/lib64/libhiprt0200164.so
#2  0x00007ffff7f2afe7 in hiprtCreateGeometries () from /usr/lib64/libhiprt0200164.so
#3  0x00007ffff7f2b0af in hiprtCreateGeometry () from /usr/lib64/libhiprt0200164.so
#4  0x0000555555559ec2 in Tutorial::run (this=0x7fffffffdb00) at ../01_geom_intersection/main.cpp:69
#5  0x0000555555559a01 in main (argc=1, argv=0x7fffffffdc68) at ../01_geom_intersection/main.cpp:96

Sorry for the late reply. Could you try this particular version of 5.7, plesae? https://repo.radeon.com/amdgpu-install/23.20/ubuntu/focal/

Confirms that with the newest hiprtsdk (2.1.c202dac) the issue perssits.

Also I uses the Orochi bundled by hiprtsdk-2.0.0 because the newest one cause errors even on 00_context_creation:

Starting program: /data/wuyy/hiprt-2.1/tutorials/dist/bin/Debug/00_context_creation64D 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/opt/gentoo/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) thread apply all bt

Thread 1 (Thread 0x7ffff7e9e740 (LWP 3483337) "00_context_crea"):
#0  0x0000000000000000 in ?? ()
#1  0x0000555555559cf2 in oroGetErrorString (error=4294967295, pStr=0x7fffffffbf50) at ../../contrib/Orochi/Orochi/Orochi.cpp:242
#2  0x0000555555572ed6 in checkOro (res=4294967295, file=0x55555557b4e8 "../00_context_creation/main.cpp", line=29) at ../common/TutorialBase.cpp:33
#3  0x000055555556dbb5 in main (argc=1, argv=0x7fffffffc418) at ../00_context_creation/main.cpp:29

It seems that Orochi did not load the function. Could you check please these paths on your system? https://github.com/amdadvtech/Orochi/blob/cdf5c7624dd826335c2d2022ddfb770178cad46a/contrib/hipew/src/hipew.cpp#L295-L298

It seems that Orochi did not load the function. Could you check please these paths on your system? https://github.com/amdadvtech/Orochi/blob/cdf5c7624dd826335c2d2022ddfb770178cad46a/contrib/hipew/src/hipew.cpp#L295-L298

Oh, these locations does not exists on my system. My hip libraries are installed in /opt/gentoo/usr/lib64

After fixing this issue, I got similar issue with @LAKostis:

Thread 2 (Thread 0x7fffe89ff6c0 (LWP 62143) "01_geom_interse"):
#0  0x00007ffff7a5627b in ioctl () from /opt/gentoo/lib64/libc.so.6
#1  0x00007ffff7909e80 in ?? () from /opt/gentoo/usr/lib64/libhsakmt.so.1
#2  0x00007ffff7902ce6 in hsaKmtWaitOnMultipleEvents_Ext () from /opt/gentoo/usr/lib64/libhsakmt.so.1
#3  0x00007ffff52e52ca in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#4  0x00007ffff52bd30e in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#5  0x00007ffff52dafea in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#6  0x00007ffff52825a7 in ?? () from /opt/gentoo/usr/lib64/libhsa-runtime64.so.1
#7  0x00007ffff79e7069 in ?? () from /opt/gentoo/lib64/libc.so.6
#8  0x00007ffff7a5a708 in ?? () from /opt/gentoo/lib64/libc.so.6

Thread 1 (Thread 0x7ffff7e99740 (LWP 62117) "01_geom_interse"):
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7f3706f in ?? () from ../../hiprt/linux64/libhiprt0200164.so
#2  0x00007ffff7f5efe7 in hiprtCreateGeometries () from ../../hiprt/linux64/libhiprt0200164.so
#3  0x00007ffff7f5f0af in hiprtCreateGeometry () from ../../hiprt/linux64/libhiprt0200164.so
#4  0x000055555556e05c in Tutorial::run (this=0x7fffffffc240) at ../01_geom_intersection/main.cpp:69
#5  0x000055555556dbd8 in main (argc=1, argv=0x7fffffffc3a8) at ../01_geom_intersection/main.cpp:96

Hello, meanwhile, we released source codes of HIPRT. I know it's not perfect solution but you can try to compile HIPRT for your system. The compilation should be straightforward: https://github.com/GPUOpen-LibrariesAndSDKs/HIPRT