not-an-aardvark/lucky-commit

Assertion error when linked with intel-compute-runtime

Closed this issue · 5 comments

Using Arch Linux, today, lucky_commit fails, I was using the arch package, cargo installed it, and it fails all the same:

~/.cargo/bin/lucky_commit 0000000                                      
/usr/include/c++/12.2.0/bits/stl_vector.h:1142: std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](size_type) const [with _Tp = NEO::ArgTypeMetadataExtended; _Alloc = std::allocator<NEO::ArgTypeMetadataExtended>; const_reference = const NEO::ArgTypeMetadataExtended&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
[1]    92025 IOT instruction (core dumped)  ~/.cargo/bin/lucky_commit 0000000

This is using an intel graphics card with intel-compute-runtime installed.

If I remove intel-compute-runtime, it works, but, well, it is slooowww 😞

This error seems like it's coming from an out-of-bounds read inside intel-compute-runtime. I'm not sure I'll be able to help much with this given the info provided, since I don't have access to that GPU/OS setup and this seems to work fine on other systems.

If you're interested in debugging yourself, it could be interesting to get a stacktrace by throwing gdb at it or something, or doing some analysis of the coredump.

I ran it through gdb, here is the backtrace, this is where my knowledge stops though, unfortunately, I have no idea how to take further.

(gdb) run 0000
Starting program: /usr/bin/lucky_commit 0000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[Detaching after vfork from child process 6637]
[New Thread 0x7ffff6a606c0 (LWP 6708)]
/usr/include/c++/12.2.0/bits/stl_vector.h:1142: std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](size_type) const [with _Tp = NEO::ArgTypeMetadataExtended; _Alloc = std::allocator<NEO::ArgTypeMetadataExtended>; const_reference = const NEO::ArgTypeMetadataExtended&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Thread 1 "lucky_commit" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44	     return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7cdd6b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7c8d938 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7c7753d in __GI_abort () at abort.c:79
#4  0x00007ffff6b340a2 in std::__glibcxx_assert_fail (file=<optimized out>, line=<optimized out>, function=<optimized out>, condition=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/debug.cc:60
#5  0x00007ffff6e86d2b in ?? () from /usr/lib/intel-opencl/libigdrcl.so
#6  0x00007ffff6e2bcd8 in ?? () from /usr/lib/intel-opencl/libigdrcl.so
#7  0x00007ffff7f5a2fe in clGetKernelArgInfo () from /usr/lib/libOpenCL.so.1
#8  0x00005555555af71b in ?? ()
#9  0x00005555555af40a in ?? ()
#10 0x00005555555b0cd4 in ?? ()
#11 0x000055555556c5b4 in ?? ()
#12 0x0000555555595c96 in ?? ()
#13 0x0000555555585a63 in ?? ()
#14 0x0000555555594e47 in ?? ()
#15 0x00007ffff7c78290 in __libc_start_call_main (main=main@entry=0x555555594af0, argc=argc@entry=2, argv=argv@entry=0x7fffffffc408) at ../sysdeps/nptl/libc_start_call_main.h:58
#16 0x00007ffff7c7834a in __libc_start_main_impl (main=0x555555594af0, argc=2, argv=0x7fffffffc408, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffc3f8)
    at ../csu/libc-start.c:381
#17 0x0000555555562335 in ?? ()
(gdb)

Thanks, that does help a bit. If you reinstall lucky_commit and compile it in debug mode (cargo install lucky_commit --debug --locked) and run gdb again, does the backtrace change at all? (I'm wondering if some of those "??" entries in the stacktrace would be fixed if the rust binary is compiled with debug symbols, but I'm not sure exactly how that works.)

It does :

Thread 1 "lucky_commit" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44	     return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7cdd6b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7c8d938 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7c7753d in __GI_abort () at abort.c:79
#4  0x00007ffff6b340a2 in std::__glibcxx_assert_fail (file=<optimized out>, line=<optimized out>, function=<optimized out>, condition=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/debug.cc:60
#5  0x00007ffff6e86d2b in ?? () from /usr/lib/intel-opencl/libigdrcl.so
#6  0x00007ffff6e2bcd8 in ?? () from /usr/lib/intel-opencl/libigdrcl.so
#7  0x00007ffff7f5a2fe in clGetKernelArgInfo () from /usr/lib/libOpenCL.so.1
#8  0x00005555556b1a6c in ocl_core::functions::get_kernel_arg_info (obj=0x7fffffff69a8, arg_index=0, request=ocl_core::KernelArgInfo::TypeName, device_versions=...)
    at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/ocl-core-0.11.2/src/functions.rs:2197
#9  0x0000555555625154 in ocl::standard::kernel::arg_info (core=0x7fffffff69a8, arg_idx=0, info_kind=ocl_core::KernelArgInfo::TypeName)
    at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/ocl-0.19.3/src/standard/kernel.rs:1499
#10 0x000055555562520c in ocl::standard::kernel::arg_type_name (core=0x7fffffff69a8, arg_idx=0) at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/ocl-0.19.3/src/standard/kernel.rs:1505
#11 0x00005555556284bb in ocl::standard::kernel::arg_type::ArgType::from_kern_and_idx (core=0x7fffffff69a8, arg_idx=0)
    at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/ocl-0.19.3/src/standard/kernel.rs:1665
#12 0x0000555555624046 in ocl::standard::kernel::KernelBuilder::build (self=0x7fffffff90f8) at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/ocl-0.19.3/src/standard/kernel.rs:1432
#13 0x00005555555c7317 in lucky_commit::HashSearchWorker<lucky_commit::Sha1>::search_with_gpu<lucky_commit::Sha1> (self=...)
    at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/lucky_commit-2.2.1/src/lib.rs:403
#14 0x00005555555cded5 in lucky_commit::HashSearchWorker<lucky_commit::Sha1>::search<lucky_commit::Sha1> (self=...) at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/lucky_commit-2.2.1/src/lib.rs:283
#15 0x000055555557d373 in lucky_commit::run_lucky_commit<lucky_commit::Sha1> (existing_commit=..., maybe_prefix=...) at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/lucky_commit-2.2.1/src/bin.rs:49
#16 0x000055555557d184 in lucky_commit::main () at /home/mat/.cargo/registry/src/github.com-1ecc6299db9ec823/lucky_commit-2.2.1/src/bin.rs:36
mat813 commented

An update of the intel-compute-runtime package fixed this.