Crash with SYCL
reguly opened this issue · 11 comments
Observed Behavior
When I try to intercept OpenCL calls coming from the intel/llvm project (specifically the "simple SYCL application" as described in here), I get an assertion failure, with the following stack trace:
#0 __GI_raise (sig=5) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff7b987eb in dummyGetPlatformIDs (num_entries=2, platforms=0x7fffffffc240,
num_platforms=0x0) at /rr-home/istvan/opencl-intercept-layer/intercept/src/stubs.cpp:34
#2 0x00007ffff7aafe9a in clGetPlatformIDs (num_entries=2, platforms=0x7fffffffc240,
num_platforms=0x0) at /rr-home/istvan/opencl-intercept-layer/intercept/src/dispatch.cpp:43
#3 0x00007ffff6f76d7c in cl::sycl::detail::pi::OclpiPlatformsGet(unsigned int, _pi_platform**, unsigned int*) () from /rr-home/shared/istvan/intel_llvm2/lib/libsycl.so
#4 0x00007ffff6f8e3d8 in cl::sycl::detail::platform_impl_pi::get_platforms() ()
from /rr-home/shared/istvan/intel_llvm2/lib/libsycl.so
#5 0x00007ffff6fb87ae in cl::sycl::platform::get_platforms() ()
from /rr-home/shared/istvan/intel_llvm2/lib/libsycl.so
#6 0x00007ffff6fb6418 in cl::sycl::device::get_devices(cl::sycl::info::device_type) ()
from /rr-home/shared/istvan/intel_llvm2/lib/libsycl.so
#7 0x00007ffff6fb6838 in cl::sycl::device_selector::select_device() const ()
from /rr-home/shared/istvan/intel_llvm2/lib/libsycl.so
#8 0x0000000000406120 in cl::sycl::queue::queue(cl::sycl::device_selector const&, cl::sycl::property_list const&) ()
Desired Behavior
I'd like to see OpenCL kernels and API calls logged
Steps to Reproduce
Tested on several platforms, including Ubuntu 18 and Debian 9.9, with CPU and Intel GPU targets.
cliloader -d ./a.out
Running the code without cliloader works fine.
Hi @reguly, would it be possible to attach your intercept layer log? You can set LogToFile
to collect this, or just capture the log from stdout/stderr.
I'm most interested in the first few lines, since this behavior is usually caused when the intercept layer can't find the "real" libOpenCL.so. If the intercept layer isn't finding your "real" libOpenCL.so automatically you can provide the path manually using the DllName controls - which I really need to rename to be less Windows-centric.
I use the intercept layer regularly with SYCL, both from Intel and other vendors, so it definitely works. Thanks!
Quick update: I changed the control name to OpenCLFileName
in 959bced. The old control name will still work, but I think the new control name makes a lot more sense on non-Windows operating systems.
I usually use ldd
to find the full absolute path to the "real" libOpenCL.so. I'm a little surprised that things didn't just work out of the box on Ubuntu 18, since that's the distro I use and test with regularly. Are you using a standard mechanism to install libOpenCL.so? If I need to add another default location to search that's easy enough to do. Thanks!
So the DllName control solves the issue - on Debian, I created the clintercept.conf file, and put
DllName=/opt/...../libOpenCL.so.1
On the Ubuntu system however, even though it does find the file, it seems to ignore that control:
CLIntercept environment variable prefix: CLI_
CLIntercept config file: clintercept.conf
Trying to load dispatch from: ./real_libOpenCL.so
Couldn't load library: ./real_libOpenCL.so
Trying to load dispatch from: /usr/lib/x86_64-linux-gnu/libOpenCL.so
Couldn't get exported function pointer to: clSetDefaultDeviceCommandQueue
Couldn't get exported function pointer to: clGetDeviceAndHostTimer
Couldn't get exported function pointer to: clGetHostTimer
Couldn't get exported function pointer to: clCreateProgramWithIL
Couldn't get exported function pointer to: clCloneKernel
Couldn't get exported function pointer to: clGetKernelSubGroupInfo
Couldn't get exported function pointer to: clEnqueueSVMMigrateMem
Couldn't get exported function pointer to: clSetProgramReleaseCallback
Couldn't get exported function pointer to: clSetProgramSpecializationConstant
But if I put the control as an environment variable, it does work:
CLI_DllName=/opt/.../libOpenCL.so.1 ../opencl-intercept-layer/install/bin/cliloader -d ./a.out
(On the Ubuntu system I use the OneAPI beta drivers, though I'm not sure why that should make a difference)
Great, that's good to hear.
In the Ubuntu snip above it looks like the intercept layer is finding a libOpenCL.so, but it's an older OpenCL 2.0 lib. This is why the OpenCL 2.1 and newer entry points are not found. This is OK and everything should still work fine, you just won't be able to intercept or use any of the OpenCL 2.1 or newer APIs. Do you still see a crash on Ubuntu in this case?
I'd still like to understand why the config file isn't being found on Ubuntu, though. On your Ubuntu system is there anything strange going on with users or home directories? The directory to search for the config file is found by looking up the user's home directory via the "HOME" environment variable, so if this environment variable is missing or set incorrectly the config file may not be found.
Are you able to set any other controls via the config file?
Thanks!
I still do get a crash on Ubuntu, because SYCL is calling clCreateProgramWithIL, which is not found in the libOpenCL.so that was loaded.
The config file is my mistake again, I put it in the current directory instead of the home directory, apologies... Moving it into the home fixes the issue.
The config file is my mistake again, I put it in the current directory instead of the home directory, apologies...
No problem, glad to hear it's working now!
Do you have any suggestions to improve logging or documentation so it's less likely that users encounter a similar issue in the future? If not, can we close this issue? Thanks!
Thank you for the help!
It would be helpful to give an explicit error or warning - on neither platforms was it obvious to me from the logs that I should expect this not to work. At the location of the crash, there is no error message - and since as a user I don't know that the software stack was trying to use a function pointer that was not found in the original lib (clCreateProgramWithIL), I didn't know that the warning about that particular function pointer not found was a problem.
I don't know that the software stack was trying to use a function pointer that was not found in the original lib (clCreateProgramWithIL),
Hmm, I wonder if there is something still going on here. In general, while it's possible that a crash or segfault will occur when using the intercept layer (due to a missing function, or for other some reason), the same crash or segfault should also occur without the intercept layer.
Assuming your SYCL program works without the intercept layer but crashes with it, could you please check:
-
Do you have multiple OpenCL ICD loaders (
libOpenCL.so
) installed on your system? Actually, based on a few messages above, it looks like this is the case: the log snippet found the lib/usr/lib/x86_64-linux-gnu/libOpenCL.so
, albeit withoutclCreateProgramWithIL
, whereas passing a different lib in/opt
via an environment variable foundclCreateProgramWithIL
and worked? -
If so, which ICD loader gets used by default, without the intercept layer? The easiest way to find this out is to run
ldd <your app name>
. I'm guessing it's not the one in/usr/lib/...
.
Thanks!
I do probably have multiple loaders, one that came with the basic installation, and one that came with the OneAPI toolkit. On the Ubuntu system:
ldd a.out
libOpenCL.so.1 => /opt/intel/inteloneapi/compiler/latest/linux/lib/libOpenCL.so.1
and so without the intercept layer it runs fine.
But with the intercept layer that is not the file being loaded, leading to the crash:
CLIntercept (64-bit) is loading...
CLintercept file location: /home/ireguly/opencl-intercept-layer/install/bin/../lib/libOpenCL.so
CLIntercept URL: https://github.com/intel/opencl-intercept-layer
CLIntercept git description: v2.2.1-154-g959bced
CLIntercept git refspec: refs/heads/master
CLInterecpt git hash: 959bced
CLIntercept optional features:
cliloader(supported)
cliprof(supported)
kernel overrides(supported)
ITT tracing(supported)
MDAPI(supported)
CLIntercept environment variable prefix: CLI_
CLIntercept config file: clintercept.conf
Trying to load dispatch from: ./real_libOpenCL.so
Couldn't load library: ./real_libOpenCL.so
Trying to load dispatch from: /usr/lib/x86_64-linux-gnu/libOpenCL.so
Couldn't get exported function pointer to: clSetDefaultDeviceCommandQueue
Couldn't get exported function pointer to: clGetDeviceAndHostTimer
Couldn't get exported function pointer to: clGetHostTimer
Couldn't get exported function pointer to: clCreateProgramWithIL
Couldn't get exported function pointer to: clCloneKernel
Couldn't get exported function pointer to: clGetKernelSubGroupInfo
Couldn't get exported function pointer to: clEnqueueSVMMigrateMem
Couldn't get exported function pointer to: clSetProgramReleaseCallback
Couldn't get exported function pointer to: clSetProgramSpecializationConstant
... success!
ReportToStderr is set to a non-default value!
DevicePerformanceTiming is set to a non-default value!
Timer Started!
... loading complete.
Running on Intel(R) Gen9 HD Graphics NEO
Trace/breakpoint trap (core dumped)
Similar issue on the Debian system, though there based on the logs I am not sure which loader it does end up using...:
ldd a.out
libOpenCL.so.1 => /opt/intel/opencl-scyl-experimental-9defd0a/oclcpuexp/x64/libOpenCL.so.1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
CLIntercept (64-bit) is loading...
CLintercept file location: /rr-home/istvan/opencl-intercept-layer/install/bin/../lib/libOpenCL.so
CLIntercept URL: https://github.com/intel/opencl-intercept-layer
CLIntercept git description: v2.2.1-154-g959bced
CLIntercept git refspec: refs/heads/master
CLInterecpt git hash: 959bced
CLIntercept optional features:
cliloader(supported)
cliprof(supported)
kernel overrides(supported)
ITT tracing(supported)
MDAPI(supported)
CLIntercept environment variable prefix: CLI_
CLIntercept config file: clintercept.conf
Trying to load dispatch from: ./real_libOpenCL.so
Couldn't load library: ./real_libOpenCL.so
Trying to load dispatch from: /usr/lib/x86_64-linux-gnu/libOpenCL.so
Couldn't load library: /usr/lib/x86_64-linux-gnu/libOpenCL.so
Trying to load dispatch from: /opt/intel/opencl/lib64/libOpenCL.so
Couldn't load library: /opt/intel/opencl/lib64/libOpenCL.so
ReportToStderr is set to a non-default value!
Timer Started!
... loading complete.
Trace/breakpoint trap
Hi @reguly , I just added a troubleshooting / FAQ document that covers the problem described in this issue and steps to identify and fix it. Take a look and let me know if it's helpful - thanks!
That is helpful, thank you!