`clCommandNDRangeKernelKHR` does not return `CL_INVALID_WORK_GROUP_SIZE` when invalid work size are passed

Question

`clCommandNDRangeKernelKHR` does not return `CL_INVALID_WORK_GROUP_SIZE` when invalid work size are passed

mfrancepillois opened this issue a year ago · 5 comments

While testing the Command Buffer Emulation layer, I noticed that clCommandNDRangeKernelKHR does not return CL_INVALID_WORK_GROUP_SIZE when invalid work size is passed whereas clEnqueueNDRangeKernel returns it.
When using the Command Buffer Emulation layer this error code is actually returned when calling clEnqueueCommandBufferKHR.

Test case

I set up a simple test based on 04Julia sample code to show this problem:

#include <popl/popl.hpp>

#define STB_IMAGE_WRITE_IMPLEMENTATION
#include <stb/stb_image_write.h>

#include <CL/opencl.hpp>

#include <chrono>

const char *filename = "julia.bmp";

const float cr = -0.123f;
const float ci = 0.745f;

static const char kernelString[] = R"CLC(
kernel void Julia( global uchar4* dst, float cr, float ci )
{
    const float cMinX = -1.5f;
    const float cMaxX =  1.5f;
    const float cMinY = -1.5f;
    const float cMaxY =  1.5f;

    const int cWidth = get_global_size(0);
    const int cIterations = 16;

    int x = (int)get_global_id(0);
    int y = (int)get_global_id(1);

    float a = x * ( cMaxX - cMinX ) / cWidth + cMinX;
    float b = y * ( cMaxY - cMinY ) / cWidth + cMinY;

    float result = 0.0f;
    const float thresholdSquared = cIterations * cIterations / 64.0f;

    for( int i = 0; i < cIterations; i++ ) {
        float aa = a * a;
        float bb = b * b;

        float magnitudeSquared = aa + bb;
        if( magnitudeSquared >= thresholdSquared ) {
            break;
        }

        result += 1.0f / cIterations;
        b = 2 * a * b + ci;
        a = aa - bb + cr;
    }

    result = max( result, 0.0f );
    result = min( result, 1.0f );

    // RGBA
    float4 color = (float4)( result, sqrt(result), 1.0f, 1.0f );

    dst[ y * cWidth + x ] = convert_uchar4(color * 255.0f);
}
)CLC";

void test(int platformIndex, int deviceIndex, size_t gwx, size_t gwy,
          size_t lwx, size_t lwy) {
  std::vector<cl::Platform> platforms;
  cl::Platform::get(&platforms);

  printf("Running on platform: %s\n",
         platforms[platformIndex].getInfo<CL_PLATFORM_NAME>().c_str());

  std::vector<cl::Device> devices;
  platforms[platformIndex].getDevices(CL_DEVICE_TYPE_ALL, &devices);

  printf("Running on device: %s\n",
         devices[deviceIndex].getInfo<CL_DEVICE_NAME>().c_str());

  cl::Context context{devices[deviceIndex]};
  cl::CommandQueue commandQueue =
      cl::CommandQueue{context, devices[deviceIndex]};

  cl::Program program{context, kernelString};
  program.build();
  cl::Kernel kernel = cl::Kernel{program, "Julia"};

  cl::Buffer deviceMemDst =
      cl::Buffer{context, CL_MEM_ALLOC_HOST_PTR, gwx * gwy * sizeof(cl_uchar4)};

  // execution
  {
    kernel.setArg(0, deviceMemDst);
    kernel.setArg(1, cr);
    kernel.setArg(2, ci);

    auto ResEnqueue = commandQueue.enqueueNDRangeKernel(
        kernel, cl::NullRange, cl::NDRange{gwx, gwy}, cl::NDRange{lwx, lwy});
    std::cout << "Result enqueueNDRangeKernel = " << ResEnqueue << std::endl;

    // Ensure all processing is complete before stopping the timer.
    commandQueue.finish();

    cl_command_buffer_khr cmdbuf =
        clCreateCommandBufferKHR(1, &commandQueue(), NULL, NULL);

    cl_sync_point_khr sync_point;
    auto ResAppend = clCommandNDRangeKernelKHR(
        cmdbuf, NULL, NULL, kernel(), 2, NULL, cl::NDRange{gwx, gwy},
        cl::NDRange{lwx, lwy}, 0, NULL, &sync_point, NULL);
    clFinalizeCommandBufferKHR(cmdbuf);

    auto ResEnqueueCB =
        clEnqueueCommandBufferKHR(0, NULL, cmdbuf, 0, NULL, NULL);

    std::cout << "Result clCommandNDRangeKernelKHR = " << ResAppend
              << std::endl;
    std::cout << "Result clEnqueueCommandBufferKHR = " << ResEnqueueCB
              << std::endl;
  }
}

int main(int argc, char **argv) {
  int platformIndex = 0;
  int deviceIndex = 0;

  {
    popl::OptionParser op("Supported Options");
    op.add<popl::Value<int>>("p", "platform", "Platform Index", platformIndex,
                             &platformIndex);
    op.add<popl::Value<int>>("d", "device", "Device Index", deviceIndex,
                             &deviceIndex);

    bool printUsage = false;
    try {
      op.parse(argc, argv);
    } catch (std::exception &e) {
      fprintf(stderr, "Error: %s\n\n", e.what());
      printUsage = true;
    }

    if (printUsage || !op.unknown_options().empty() ||
        !op.non_option_args().empty()) {
      fprintf(stderr,
              "Usage: julia [options]\n"
              "%s",
              op.help().c_str());
      return -1;
    }
  }

  std::cout << "Valid Sizes:" << std::endl;
  test(platformIndex, deviceIndex, 512, 512, 8, 8);

  std::cout << "Invalid Sizes:" << std::endl;
  test(platformIndex, deviceIndex, 8, 8, 16, 16);

  return 0;
}

Output

$> OPENCL_LAYERS=../../layers/10_cmdbufemu/libCmdBufEmu.so ./julia -p 3

**Valid Sizes:**
Running on platform: Intel(R) OpenCL
Running on device: 12th Gen Intel(R) Core(TM) i9-12900K
Result enqueueNDRangeKernel = 0
Result clCommandNDRangeKernelKHR = 0
Result clEnqueueCommandBufferKHR = 0
**Invalid Sizes:**
Running on platform: Intel(R) OpenCL
Running on device: 12th Gen Intel(R) Core(TM) i9-12900K
Result enqueueNDRangeKernel = **-54**
Result clCommandNDRangeKernelKHR = **0**
Result clEnqueueCommandBufferKHR = **-54**

Request

I understand from the specification that clCommandNDRangeKernelKHR should return the same errors as clEnqueueNDRangeKernel, except for a few, but CL_INVALID_WORK_GROUP_SIZE should not be one of them.

If so, could you please fix this issue?

Answer 1 · 2024-01-02T00:06:53.000Z

Thanks for the report! I agree that the error should be returned by clCommandNDRangeKernelKHR and not clEnqueueCommandBufferKHR but it's going to be a little challenging for the command buffer emulation layer to implement this. The reason why is because the call to clCommandNDRangeKernelKHR doesn't really do much OpenCL-wise and instead just records the call and its arguments for later. This is also why the error is raised on the call to clEnqueueCommandBufferKHR, because this is when clEnqueueNDRangeKernel actually gets called.

I was hoping there would be some other proxy call I could make that might catch this type of error, but unfortunately I don't see any other calls that could identify CL_INVALID_WORK_GROUP_SIZE. I think this means if we want to catch these errors in the layer we'll need to duplicate the error checking in the layer itself, but I'm a little worried this could be tricky and error-prone. As an example, the "invalid sizes" are actually valid if the device supports non-uniform work-group sizes and the kernel was compiled for OpenCL 2.0 or newer!

Can you think of any other clever way to do this error checking? If not, how important is it that this error gets properly returned by clCommandNDRangeKernelKHR?

Answer 2 · 2024-01-02T23:23:54.000Z

Can you think of any other clever way to do this error checking?

Here's a bit of a crazy idea that might work and avoid duplicate error checking, but also seems a little tricky and error-prone:

Create one or more "test queues" during command buffer creation, one for each command queue that the command buffer will record to. Each "test queue" will derive its properties from the passed-in command queues. For example, the "test queue" will be created against the same device as the passed-in command queues, and perhaps using some or all of command queue properties for each of the passed-in command queues.
Also create a user event during command buffer creation, and enqueue a barrier dependent upon the user event into each command queue.
For every command recorded into the command buffer, also enqueue it to the "test queue". This should perform error checking for things like CL_INVALID_WORK_GROUP_SIZE, but because the command in the "test queue" is dependent upon the user event it will not execute.
As part of command buffer finalization, set the user event to an error state, which should cause all commands enqueued into the "test queue" to be terminated. Then, free the user event and all of the "test queues".

I think this is probably simpler than duplicating the error checking in the command buffer emulation layer, but I'm still not convinced it's worth the complexity.

Answer 3 · 2024-01-07T23:37:05.000Z

I have a version of the idea described above implemented and it seems to be working:

https://github.com/bashbaug/SimpleOpenCLSamples/compare/cmdbuf-emu-test-queues

I think this is the expected output with proper error checking?

$ OPENCL_LAYERS=./libCmdBufEmu.so LD_PRELOAD=../lib/libOpenCL.so.1 ./juliacmdbuf -p2
Valid Sizes:
Running on platform: Intel(R) OpenCL HD Graphics
Running on device: Intel(R) Arc(TM) A750 Graphics
Result enqueueNDRangeKernel = 0
Result clCommandNDRangeKernelKHR = 0
Result clEnqueueCommandBufferKHR = 0
Invalid Sizes:
Running on platform: Intel(R) OpenCL HD Graphics
Running on device: Intel(R) Arc(TM) A750 Graphics
Result enqueueNDRangeKernel = -54
Result clCommandNDRangeKernelKHR = -54
Result clEnqueueCommandBufferKHR = 0

Note, I'm still a little nervous about enabling this by default, so you'll need to explicitly enable it by setting g_cEnhancedErrorChecking to true in the emulation layer main.cpp. If this solves your problem I can look into a more friendly way to enable the "enhanced error checking", either dynamically (enviroment variable?) or via a CMake build option.

Note also, one of the GPUs I regularly test with appears to be outputting some information to stdout when the commands in the test queue are terminated. The output is coming from the driver so there isn't anything the layer can do to suppress it. The programs I am running to test with seem to be working fine even with the extra output, but it can be a little surprising. The output looks like:

CL_DEVICE_NOT_FOUND error executing CL_COMMAND_NDRANGE_KERNEL on <device name> (Device <device number>).

Answer 4 · 2024-01-18T14:33:16.000Z

I tried the cmdbuf-emu-test-queues branch out with the SYCL-Graph tests we had which motivated this issue, and setting g_cEnhancedErrorChecking does indeed fix the issues. See

The idea of having a test queue I think makes sense, rather than duplicating all the error checking code which might not even be possible to catch all the CL_INVALID_WORK_GROUP_SIZE cases.

Although enqueuing this extra work will have overhead, could maybe think of "enhanced error checking" as a kind of layer inside the layer that is enabled by an environment variable, since the user will already be setting the OPENCL_LAYERS environment variable.

Answer 5 · 2024-01-18T15:26:08.000Z

OK great, that's encouraging.

I'll play around with several options to enable the "enhanced error checking" via a more friendly mechanism.