Improve OpenCL error handling

Question

Improve OpenCL error handling

Closed this issue 3 years ago · 12 comments

Currently OpenCL errors are just reported as the OpenCL Error code from cl.h.
It would be more helpful if OpenCL errors could provide the OpenCL name of the error as provided by the ClError type from the cl3 library.

Answer 1 · 2021-04-06T10:26:32.000Z

I think it would be useful to have even richer error types in opencl3, e.g. attaching some error information from OpenCL. Due to opencl3 being a bit of a higher level library it's sometimes not possible to get all the underlying information. Let me give an example:

With opencl3 you can build a propgram via Context::build_program_from_source(). That might fail, which leads to a -11 return value. You could then check the build log via Program::get_build_log(). The problem is that you need access to program, but program is only available inside the Context::build_program_from_source() method, you cannot access it from the outside. Therefore you cannot get the build log through this API.

Hence it would be great if the error returned from Context::build_program_from_source() could contain more than just a single integer value.

There are several ways, depending on how high-level the error handling in opencl3 should become. I would prefer if it would be high level enough, so that you won't need to do many things manually. In this case it would mean that in case of an error, you would retrieve the build log within the opencl3 code and attach the log to the error message.

A lower level approach would be to attach the instance of the program that failed to the error. This way you could get the build log if you want from the program.

The lower level approach seems cleaner/simpler, but if you think about how people would use that API, I would rather expect that either you don't care about the exact error (then the higher level approach is also fine), or you want the build log (what else would you want from the program), so why not returning the build log directly?

Answer 2 · 2021-04-06T17:58:15.000Z

Thank you @vmx, you raise a good point about the error logs from Context::build_program_from_source() and I'm sure that there are issues with other Context methods too.

I'm not happy with the current design of Context containing Programs, Kernels and Command Queues which leads to this kind of issue. And although I understand your desire to return build logs as error messages (for example), it is not my preference.

I shall think it over and see if I can come up with a better design next weekend.

Answer 3 · 2021-04-07T08:20:26.000Z

I'm not happy with the current design of Context containing Programs, Kernels and Command Queues which leads to this kind of issue. And although I understand your desire to return build logs as error messages (for example), it is not my preference.

Not doing the error message as part of the error is fine with me. My proposals were only based on the current design of the Context. I don't know if it helps, the code I'm working on had a similar abstraction as your Context, which might be an indication that those higher level abstractions are application specific (e.g. do you have a single queue or multiple) and perhaps opencl3 would then only provide the building blocks.

Answer 4 · 2021-04-11T11:37:59.000Z

@vmx I have redesigned Context in the develop branch.

Context now only holds the list of devices which it was created for. This solves other issues with circular dependencies and function duplication. It is now a "lower level" interface providing "building blocks" but hopefully, it is a more: flexible, useful and consistent interface for most users.

Answer 5 · 2021-04-11T18:52:40.000Z

@kenba Thanks. I've ported my code to the current code of the develop branch and it seems to work (it compiles, I'll run tests later this week). The new code is now more similar to how ocl does things, so it might make it also for other people porting their code easier.

This also means that my comment about including the build error with the error message is no longer applicable.

Answer 6 · 2021-04-12T18:46:29.000Z

Thanks @vmx. Let me know if you have any problems running the tests.
I'll look into ocl and see if there are further changes I can make to ease porting.

Answer 7 · 2021-04-13T08:50:56.000Z

The testing went well with the develop branch.

Answer 8 · 2021-04-17T11:57:19.000Z

That's good news about the testing @vmx.

I've now changed Program to store the names of the OpenCL kernels in the program. This change requires the compile, link and build methods to take mutable references, which they should have done in any case, since they change the state of Program.

It is now preferred to use methods such as create_and_build_from_source to create a Program. These methods return errors as strings (as you requested) so that the build log is returned as an error when there is a program build failure.

Answer 9 · 2021-04-20T12:00:13.000Z

Thanks adding a function that also return the build log.

I've a point about error handling in general. Returning just a string is often not ideal (some good points are also here https://blog.burntsushi.net/rust-error-handling/#defining-your-own-error-type). When you want to deal with an error on the application side, it's often useful to know where the error comes from and you can often to this by type.

From the code I'm working on, we wrap opencl3 errors in our own errors. Example:

pub enum GpuError {
    Opencl3(ClError, String),
    …
}

impl From<ClError> for GpuError {
    fn from(error: ClError) -> Self {
        GpuError::Opencl3(error, "".to_string())
    }
}

I could now implement impl From<String> for GpuError and also return an GpuError::Opencl3() which contains the string as error message. The problem here is that now every error that is a String could be converted. So if another library does the same, there is no way to distinguish them. If opencl3 wrap that string somehow, they could be distinguished.

All that isn't a big deal, what I currently do, is doing the wrapping when the error occurs (though kind of automated wrapping would be nice. My code is:

opencl3::program::Program::create_and_build_from_source(&context, src, "")
   .map_err(|error| {
       GPUError::Opencl3(
           opencl3::error_codes::ClError(
               opencl3::error_codes::CL_BUILD_PROGRAM_FAILURE,
           ),
           error,
       )
   })?;

Also note that using a string only uses the original error code from the message. I know you could use create_from_source() and build() separately, but I think that convenience method could still contain all that information.

For now I don't propose any specific implementation for a different error structure, if you like to hear my thoughts on that, I can surely provide some.

Answer 10 · 2021-04-24T11:00:44.000Z

Thank you for sharing your thoughts @vmx. Here are my thoughts.

I designed the opencl3 library to provide a simple, object-based model of the OpenCL API. My intention was to provide an easy-to-use API layer over the OpenCL API. In practice there are two layers: the functional layer provided by the cl3 library and the object-based layer provided by this library.

I have tried NOT to obfuscate the underlying OpenCL API wherever possible. So all the cl3 functions simply return the OpenCL error code, while all the opencl3 functions and methods return the OpenCL error code wrapped in a ClError so the error code can be converted to a text string using the Display trait.

The only methods that don’t return ClErrors are the Program::create_and_build_from_source(s) methods which return Strings containing the build log as you suggested in one of your previous comments Volker. I am happy for them to just return ClError (not the build log) since it would be more consistent. Is that what you want?

As stated earlier: this is a "lower level" interface providing "building blocks". Please feel free to build whatever error structure you want on top of it. I don’t want to change the current error structure other than maybe implementing the std::error::Error trait in the cl3 library as recommended in Andrew Gallant’s blog article that you referenced.

Answer 11 · 2021-04-26T09:05:20.000Z

The only methods that don’t return ClErrors are the Program::create_and_build_from_source(s) methods which return Strings containing the build log as you suggested in one of your previous comments Volker. I am happy for them to just return ClError (not the build log) since it would be more consistent. Is that what you want?

I see two ways:

opencl3 provides some useful higher level methods like create_and_build_from_sources(). But then it should wrap the error in its own structured error type. E.g. something like ClErrorWithMessage(cl_int, String) (the ClError could also become an enum, the details don't matter, it's about the general idea). So that libraries building on top opencl3 can deal with those errors easily.
opencl3 keeps things quite low level, and you won't have a create_and_build_from_sources() function at all, it will always be two steps. You create the program first and then build it.

What I wouldn't want to see is a create_and_build_from_sources() that doesn't return the build log. Then we would be back to my original comment, where you cannot get back the build log yourself.

From how I understand opencl3 (also based on out discussions) it seems to me that (2) is the better solution for opencl3. Introducing a new error type for just a single functionality seems bloated. If it turns out there are more cases that could leverage richer error types (other than just returning an int), it might could be the time to revisit that decision.

Answer 12 · 2021-04-27T18:58:07.000Z

Thank you @vmx, I think that you are right and that the library may be better off without the create_and_build_from_sources() functions. As it stands they are only valid for a context with one device because they only return the build log from the first device...

I think that it is a useful convenience for someone who doesn't care about error types and simply wants to get OpenCL running on a single device. However, you would definitely be better off calling the create, build and maybe get_build_log functions separately to get consistent error types.