segfault in clGetPlatformIDs() on CUDA 12 when OpenCL built as plugin

Question

segfault in clGetPlatformIDs() on CUDA 12 when OpenCL built as plugin

Opened this issue a year ago · 24 comments

What version of hwloc are you using?

2.9.3
lstopo 2.9.3
ldd /opt/apps/hwloc/2.9.3/bin/lstopo
linux-vdso.so.1 => (0x00007fffae325000)
libhwloc.so.15 => /opt/apps/hwloc/2.9.3/lib/libhwloc.so.15 (0x00007f402b51d000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f402b319000)
libm.so.6 => /lib64/libm.so.6 (0x00007f402b017000)
libncursesw.so.5 => /lib64/libncursesw.so.5 (0x00007f402addf000)
libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00007f402abb5000)
libcairo.so.2 => /lib64/libcairo.so.2 (0x00007f402a87e000)
libSM.so.6 => /lib64/libSM.so.6 (0x00007f402a676000)
libICE.so.6 => /lib64/libICE.so.6 (0x00007f402a45a000)
libX11.so.6 => /lib64/libX11.so.6 (0x00007f402a11c000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4029d4e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f402b78f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4029b32000)
libpixman-1.so.0 => /lib64/libpixman-1.so.0 (0x00007f4029889000)
libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f4029647000)
libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f4029388000)
libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f4029174000)
libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f4028f49000)
libxcb-shm.so.0 => /lib64/libxcb-shm.so.0 (0x00007f4028d45000)
libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f4028b1d000)
libxcb-render.so.0 => /lib64/libxcb-render.so.0 (0x00007f402890f000)
libXrender.so.1 => /lib64/libXrender.so.1 (0x00007f4028704000)
libXext.so.6 => /lib64/libXext.so.6 (0x00007f40284f2000)
libz.so.1 => /lib64/libz.so.1 (0x00007f40282dc000)
libGL.so.1 => /lib64/libGL.so.1 (0x00007f4028050000)
librt.so.1 => /lib64/librt.so.1 (0x00007f4027e48000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4027c43000)
libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f4027a18000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f4027808000)
libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007f4027552000)
libXau.so.6 => /lib64/libXau.so.6 (0x00007f402734e000)
libGLX.so.0 => /lib64/libGLX.so.0 (0x00007f402711c000)

Which operating system and hardware are you running on?

uname -a Linux 3.10.0-1160.102.1.el7.x86_64 # 1 S M P Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Le système d'exploitation est CentOS 7.9 64 bits 4 Intel Xeon Platinium 24 cores 4.40GHz 385 Go
nvidia-smi

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Details of the problem

Configuration command:
./configure --prefix=${WHERE_TO_INSTALL} --enable-debug --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --enable-opencl --with-cuda=${WHEREIS_CUDA}
Afterwards, I tried with gcc 4.8.5, 7.5.0 and 13.2 and CFLAGS='-g -O2 -fno-tree-vectorize'
./configure --prefix=${WHERE_TO_INSTALL} CFLAGS='-g -O2 -fno-tree-vectorize' --enable-debug --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --enable-opencl --with-cuda=${WHEREIS_CUDA}
- What happened?
  module load hwloc/2.9.3
  lstopo and lstopo-no-graphics return the following error: Erreur de segmentation (core dumped)
- How did you start your process?
  using lstopo or lstopo-no-graphics
- How did it fail? Crash? Unexpected result?
  Erreur de segmentation (core dumped)

IO phase discovery in component opencl...
Missing separate debuginfo for /lib64/libnvidia-opencl.so.1
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/89/f9263438b794b32b423ca59aeaddf5d661ed51.debug

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7de6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-15.el7_9.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7_9.1.x86_64 glibc-2.17-326.el7_9.x86_64 libICE-1.0.9-9.el7.x86_64 libSM-1.2.2-2.el7.x86_64 libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-egl-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 libpng-1.5.13-8.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libxcb-1.13-1.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 pixman-0.34.0-1.el7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) p *root
Cannot access memory at address 0x0
(gdb) up
#1 0x00007ffff7def66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
(gdb) p *root
Cannot access memory at address 0x0
(gdb) bt
#0 0x00007ffff7de6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7def66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff7deeb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4 0x00007ffff7965fab in dlopen_doit () from /lib64/libdl.so.2
#5 0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007ffff79665ad in _dlerror_run () from /lib64/libdl.so.2
#7 0x00007ffff7966041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8 0x00007fffe9b94c37 in ?? () from /lib64/libnvidia-opencl.so.1
#9 0x00007fffe9b46393 in ?? () from /lib64/libnvidia-opencl.so.1
#10 0x00007fffe9b47e58 in ?? () from /lib64/libnvidia-opencl.so.1
#11 0x00007fffe99caeaa in ?? () from /lib64/libnvidia-opencl.so.1
#12 0x00007fffec686fd5 in ?? () from /usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1
#13 0x00007ffff618420b in __pthread_once_slow () from /lib64/libpthread.so.0
#14 0x00007fffec6888df in clGetPlatformIDs () from /usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1
#15 0x00007fffec88d377 in hwloc_opencl_discover (backend=0x62c470, dstatus=0x7fffffffcd20) at topology-opencl.c:62
#16 0x00007ffff7b776d7 in hwloc_discover_by_phase (topology=0x62b930, dstatus=0x7fffffffcd20, phasename=0x7ffff7bc3569 "IO") at topology.c:3363
#17 0x00007ffff7b77ed6 in hwloc_discover (topology=0x62b930, dstatus=0x7fffffffcd20) at topology.c:3568
#18 0x00007ffff7b78fbc in hwloc_topology_load (topology=0x62b930) at topology.c:4114
#19 0x000000000040b111 in main (argc=0, argv=0x7fffffffd700) at lstopo.c:1687
(gdb) p *root
Cannot access memory at address 0x0

Lighter Configuration command with disable-opencl:
./configure --prefix=${WHERE_TO_INSTALL} --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --with-cuda=${WHEREIS_CUDA} --disable-opencl
lstopo does the same as lstopo-no-graphics and return without errors:
Machine (376GB total)
Package L#0
NUMANode L#0 (P#0 93GB)
L3 L#0 (36MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#96)
.
.
.
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#92)
PU L#47 (P#188)
HostBridge
PCI 00:11.5 (SATA)
PCI 00:17.0 (SATA)
PCIBridge
PCIBridge
PCI 03:00.0 (VGA)
HostBridge
PCIBridge
PCI 18:00.0 (Ethernet)
Net "em3"
PCI 18:00.1 (Ethernet)
Net "em4"
PCIBridge
PCI 17:00.0 (Ethernet)
Net "em1"
PCI 17:00.1 (Ethernet)
Net "em2"
HostBridge
PCIBridge
PCI 25:00.0 (VGA)
CoProc(CUDA) "cuda0"
GPU(NVML) "nvml0"
HostBridge
PCIBridge
PCI 33:00.0 (SATA)
Block(Disk) "sdb"
.
.
.
Package L#3
NUMANode L#3 (P#3 94GB)
L3 L#3 (36MB)
L2 L#72 (1024KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72
PU L#144 (P#3)
PU L#145 (P#99)
.
.
.
L2 L#95 (1024KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95
PU L#190 (P#95)
PU L#191 (P#191)
HostBridge
PCIBridge
PCI dc:00.0 (NVMExp)
Block(Disk) "nvme1n1"
Misc(MemoryModule)
Misc(MemoryModule)
.
.
.
Misc(MemoryModule)
Misc(MemoryModule)
But I need it to produce a graph.

Answer 1 · 2023-12-06T09:01:56.000Z

Hello. Do you know if this worked in the past on this machine? With same CUDA release? Does "clinfo" or any other OpenCL outside of hwloc work fine? The crash is very deeply inside NVIDIA's OpenCL libraries.
I cannot reproduce with our CUDA <= 11.7 on different NVIDIA GPUs on CentOS 7.6.
Also please try configuring hwloc without --enable-plugins.

Answer 2 · 2023-12-07T17:03:59.000Z

Hi @bgoglin, I'm seeing a similar issue building hwloc 2.9.3, 2.10.0, and allowing OpenMPI 5.0.0 to build its internal hwloc. However, --disable-opencl doesn't avoid the segfault like in the above post. My system has CUDA 12 installed, but no NVIDIA drivers. I've tried disabling nvml, opencl, and cuda while keeping --enable-plugins (as below).

System info:
OS: CentOS Linux release 7.9.2009 (Core)
gcc/g++ version: 8.2.0

Configure command:
./configure --prefix=/pathToBuild/openMPI_5/dependencies/hwloc-2.10.0/install --disable-cuda --disable-nvml --disable-opencl CC=/compilerPath/bin/gcc CXX=/compilerPath/bin/gxx CFLAGS='-g -O2 -fno-tree-vectorize' --enable-debug --enable-plugins

Running gdb ./lstopo:

IO phase discovery in component opencl...
warning: File "[redacted]/gcc/8.2.0.1/lib64/libstdc++.so.6.0.25-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
        add-auto-load-safe-path [redacted]/gcc/8.2.0.1/lib64/libstdc++.so.6.0.25-gdb.py
line to your configuration file "[redacted]/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/staff/tandrus/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
Missing separate debuginfo for /lib64/libnvidia-opencl.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c2/5558e5242f8bed14af228255432409b5a35cf6.debug

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaaab6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-15.el7_9.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7_9.1.x86_64 glibc-2.17-326.el7_9.x86_64 libICE-1.0.9-9.el7.x86_64 libSM-1.2.2-2.el7.x86_64 libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libpciaccess-0.14-1.el7.x86_64 libpng-1.5.13-8.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libxcb-1.13-1.el7.x86_64 pixman-0.34.0-1.el7.x86_64 zlib-1.2.7-20.el7_9.x86_64
(gdb) p *root
Cannot access memory at address 0x0
(gdb) up
#1  0x00002aaaaaabf66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00002aaaaaab6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1  0x00002aaaaaabf66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2  0x00002aaaaaaba7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3  0x00002aaaaaabeb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaf35fab in dlopen_doit () from /lib64/libdl.so.2
#5  0x00002aaaaaaba7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaf365ad in _dlerror_run () from /lib64/libdl.so.2
#7  0x00002aaaaaf36041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8  0x00002aaab4932b37 in ?? () from /lib64/libnvidia-opencl.so.1
#9  0x00002aaab48e32c7 in ?? () from /lib64/libnvidia-opencl.so.1
#10 0x00002aaab48e73c8 in ?? () from /lib64/libnvidia-opencl.so.1
#11 0x00002aaab476acda in ?? () from /lib64/libnvidia-opencl.so.1
#12 0x00002aaaaaad878d in khrIcdVendorAdd () from [redacted]/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
#13 0x00002aaaaaadccaa in khrIcdOsVendorsEnumerate () from [redacted]/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
#14 0x00002aaaac2a820b in __pthread_once_slow () from /lib64/libpthread.so.0
#15 0x00002aaaaaad9391 in clGetPlatformIDs () from [redacted]/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
#16 0x00002aaaaef23f3b in hwloc_opencl_discover () from [redacted]/hwloc-2.10.0/install/lib/hwloc/hwloc_opencl.so
#17 0x00002aaaaacd7ac0 in hwloc_discover_by_phase (dstatus=dstatus@entry=0x7fffffffbee0, phasename=phasename@entry=0x2aaaaad1b873 "IO", topology=<optimized out>, topology=<optimized out>) at topology.c:3385
#18 0x00002aaaaace03ce in hwloc_discover (dstatus=0x7fffffffbee0, topology=0x628980) at topology.c:3590
#19 hwloc_topology_load (topology=0x628980) at topology.c:4163
#20 0x0000000000405af1 in main () at lstopo.c:1823
#21 0x00002aaaabef6555 in __libc_start_main () from /lib64/libc.so.6
#22 0x000000000040a517 in _start ()

I can confirm that removing --enable-plugins mitigates the segfault. However, I'd like to build a CUDA-aware OpenMPI but am not able to guarantee the system running OpenMPI will have CUDA installed. Hence the desire to build CUDA support as a hwloc plugin.

Answer 3 · 2023-12-07T17:50:13.000Z

@tmandrus What does the backtrace look like with --disable-opencl ?

Answer 4 · 2023-12-07T19:02:53.000Z

@bgoglin That is the backtrace with --disable-opencl specified in configure CLI options. I can try rebuilding hwloc with the above configure options and capture the output if that would be helpful.

Answer 5 · 2023-12-07T20:47:22.000Z

That's strange, hwloc_opencl_discover() cannot be called when OpenCL is disabled. But if the OpenCL plugin was build earlier and you didn't remove the install directory, it will be loaded. Try rm blabla/hwloc-2.10.0/install/lib/hwloc/hwloc_opencl.so

Answer 6 · 2023-12-07T21:06:43.000Z

Thanks - I didn't realize it would stick around but that's what was happening. I removed the whole hwloc-2.10.0 directory and unpacked from the tgz and rebuilt with:

opencl, cuda, nvml disabled, plugins enabled (lstopo works)
opencl disabled, cuda+nvml as plugins (lstopo works)
cuda+nvml+opencl as plugins (lstopo doesn't work - segfault)
3a. deleted the hwloc_opencl.so library and lstopo works again
cuda+nvml+opencl enabled but not as plugins (lstopo works)

Answer 7 · 2023-12-07T21:09:49.000Z

Thanks a lot, at least you have a workaround now. I'll try to find a machine with CUDA12 to debug this OpenCL issue.

Answer 8 · 2023-12-08T08:34:34.000Z

I cannot reproduce on RHEL 8.6 with CUDA 12.[012]. I am trying to find a machine with RHEL7 like yours.

Answer 9 · 2023-12-08T15:00:56.000Z

Cannot reproduce on RHEL 7.4 with CUDA 12.2 either :(

Answer 10 · 2023-12-08T16:41:46.000Z

Ah okay, I appreciate the effort. I can also share the configure/build logs or info about my system if that would be useful? I'm also happy to rebuild for additional debugging efforts on my machine if needed.

Answer 11 · 2023-12-08T16:47:15.000Z

I am trying to prepare a small reproducer test outside of hwloc. clGetPlatformIDs is basically the first call we do in hwloc, there's not much we can debug inside hwloc itself. But it could be an ugly plugin-related issue (I've seen fears of plugin/namespaces issues for instance). It shouldn't crash, but it could explain a failure that isn't properly caught in the opencl runtime.

Answer 12 · 2023-12-09T20:42:40.000Z

Here a very simple testcase opencl.tar.gz

$  tar xf opencl.tar.gz
$ cd opencl
$ make
gcc -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: warning: #warning building main [-Wcpp]
   2 | #warning building main
     |  ^~~~~~~
gcc -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: warning: #warning building plugin [-Wcpp]
  31 | #warning building plugin
     |  ^~~~~~~
$ ./main
calling plugin_init()
found 1 platforms

Let's see if this crashes on CUDA12/RHEL7 too.

Answer 13 · 2023-12-11T15:51:03.000Z

Thanks for providing a testcase. I edited the Makefile to add gcc -I/pathToCuda/include so the include files were properly found on my system. Once that was fixed, we do get a segfault.

$ make
gcc -I/pathToCuda/include -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: warning: #warning building main [-Wcpp]
 #warning building main
  ^~~~~~~
gcc -I/pathToCuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: warning: #warning building plugin [-Wcpp]
 #warning building plugin
  ^~~~~~~

$ ./main
calling plugin_init()
Segmentation fault

Answer 14 · 2023-12-11T15:56:55.000Z

Thanks @tmandrus
@W-Wuxian can you test on your system before I open a bug at NVIDIA?

Answer 15 · 2023-12-11T16:06:45.000Z

Hi everyone, I will try it tomorrow or later this weekend Best, W-Wuxian Le lun. 11 déc. 2023, 16:57, Brice Goglin ***@***.***> a écrit :

…

Thanks @tmandrus <https://github.com/tmandrus> @W-Wuxian <https://github.com/W-Wuxian> can you test on your system before I open a bug at NVIDIA? — Reply to this email directly, view it on GitHub <#641 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIQS5LUNKEJPMFVG7KS45D3YI4UNHAVCNFSM6AAAAABAIHGATKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJQGM3DKNRYGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 16 · 2023-12-12T15:43:29.000Z

Adding path to cuda/include as the following

all: main plugin.so

main: common.c
	gcc -Wall -DONLYMAIN $< -ldl -o $@

plugin.so: common.c
	gcc -I/usr/local/cuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC $< -lOpenCL -o $@

clean:
	rm -f main plugin.so

make results as the following:

gcc -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: attention : #warning building main [-Wcpp]
 #warning building main
  ^
gcc -I/usr/local/cuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: attention : #warning building plugin [-Wcpp]
 #warning building plugin

And then ./main ends with the error as below:

calling plugin_init()
Erreur de segmentation (core dumped)

Ty

Answer 17 · 2023-12-12T15:47:03.000Z

Thanks for testing, I am reporting this to NVIDIA.

Answer 18 · 2024-01-08T06:28:11.000Z

The NVIDIA bug report didn't notify me of this reply:

We've checked in house on an exact matching configuration 'cnetOS7.9 + CUDA 12.1 + gcc4.8.5' but we have no luck to hit a reproducing in house .

However the stack looks like some GLIBC mismatching to me . Can you please check the following with the 2 reporters ?

Check where and how they install their local gcc , is it a source building which could contain mismatching headers with system one ?
Check the highest GLIBC the systems support via 'strings/lib64/ld-linux-x86-64.so.2 | grep GLIBC'
See if we can catch the ld log before crash via 'LD_DEBUG=all LD_DEBUG_OUTPUT=./x.log ./main' and upload the log file if it's not empty.

Answer 19 · 2024-01-08T23:14:39.000Z

Here's a few answers to the questions.

There are 2 gcc versions on the system on my path. Stripping my path to just one didn't eliminate the seg fault. One gcc is maintained by another group at my organization (gcc version 8.2.0) and is the first one in the path. The second is in /usr/bin/gcc (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)) and was presumably installed from a repo.
I'm not quite sure what to do with this request, but I checked the RH documentation and it shows 8.x is supported.
I ran and attached a log file here. We print the seg fault to the terminal as before.
x.log

Answer 20 · 2024-01-11T09:41:15.000Z

Hi @bgoglin , I added a reply to the NVBUG ticket . Please kindly check it . It looks like your registered email is not reachable . Our bug report system will auto sync a notification when a new comment is added . I also tried informing you via email . It is rejected like below
`Delivery has failed to these recipients or groups:
Brice Goglin (bgoglin@free.fr)
Your message couldn't be delivered. Despite repeated attempts to contact the recipient's email system it didn't respond.
Contact the recipient by some other means (by phone, for example) and ask them to tell their email admin that it appears that their email system isn't accepting connection requests from your email system. Give them the error details shown below. It's likely that the recipient's email admin is the only one who can fix this problem.
For more information and tips to fix this issue see this article: https://go.microsoft.com/fwlink/?LinkId=389361.

kely that the recipient's email admin is the only one who can fix this problem.
For more information and tips to fix this issue see this article: https://go.microsoft.com/fwlink/?LinkId=389361.

Diagnostic information for administrators:
Generating server: CH3PR12MB9395.namprd12.prod.outlook.com
Total retry attempts: 9
bgoglin@free.fr
Remote server returned '550 5.4.300 Message expired -> 451 too many errors detected from your IP (40.107.223.89), please visit http://postmaster.free.fr/'`

Answer 21 · 2024-01-11T13:39:52.000Z

Reply from the NVIDIA ticket:

I doubt he is calling intel OpenCL which libstc++ is 'GLIBCXX_3.4.20' based . See his calling stack -
Line 575: 122540: calling init: /lib64/libc.so.6
Line 582: 122540: calling init: /lib64/libdl.so.2
Line 106382: 122540: calling init: /lib64/libpthread.so.0
Line 106385: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libintlc.so.5
Line 106388: 122540: calling init: /lib64/libm.so.6
Line 106391: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
Line 106394: 122540: calling init: ./plugin.so
Line 136285: 122540: calling init: /anotherPathToInstall/gcc/8.2.0.1/lib64/libgcc_s.so.1
Line 136288: 122540: calling init: /anotherPathToInstall/gcc/8.2.0.1/lib64/libstdc++.so.6
Line 136291: 122540: calling init: /lib64/libz.so.1
Line 136294: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/oclfpga/host/linux64/lib/libelf.so.0
Line 136297: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/oclfpga/host/linux64/lib/libalteracl.so
Line 137739: 122540: calling init: /lib64/librt.so.1
122540: checking for version 'GLIBCXX_3.4.20' in file /anotherPathToInstall/gcc/8.2.0.1/lib64/libstdc++.so.6 [0] required by file /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/oclfpga/host/linux64/lib/libalteracl.so [0]

Does this reproduce for the user using intel OpenCL ICD ? I doubt it will fail same .

Answer 22 · 2024-04-02T05:32:29.000Z

@W-Wuxian @tmandrus Hello, can you answer NVIDIA's questions above if the bug still occurs? They are pinging us on the upstream bug. Thanks.

Answer 23 · 2024-04-10T13:22:38.000Z

@bgoglin Hi, I moved my build processes to a RH8 system and off of the CentOS 7.9 machine. Since then, I haven't been able to reproduce the issue, even though everything should still be using the same OneAPI version/same organization gcc compiler build. The system-wide libraries on the RH8 machine are much newer than the CentOS machine, which might be part of the reason the issue went away. Since I haven't been able to reproduce, I am happy to consider this issue resolved. Thanks for the help!

Answer 24 · 2024-04-10T13:27:04.000Z

Thanks for the feedback @tmandrus
I'll wait a bit before closing in case @W-Wuxian has something to report.