doe300/VC4CL

clEnqueueNDRangeKernel times out on large buffers

dernasherbrezon opened this issue · 13 comments

I'm seeing a very strange behaviour when running relatively large buffers (87968 bytes). I have a test FIR filter:
https://github.com/dernasherbrezon/clDsp/blob/main/fir_filter.cl

It takes input buffer, multiplies to filter taps (constant) buffer and writes to output.

The simple test ( https://github.com/dernasherbrezon/clDsp/blob/main/test/test_fir_filter.c ) works fine.

Performance test is not. When I execute the loop 10 times ( https://github.com/dernasherbrezon/clDsp/blob/main/test/perf_fir_filter.c#L47 ) it might hang on reading the data. Any subsequent executions of any programs on GPU will hang. So only reboot helps.

When I execute performance loop only once, then everything is fine. It produces valid results.

Another observation: running using "sudo" never hangs.

Another observation: running using "sudo" take ~10times slower than under normal user ( "pi" ).

How can I troubleshoot the slowness? Can be it related to some memory constraints or some user-specific limits while working with /dev/mem?

The different interfaces to access the QPUs have some "quirks", e.g. the mailbox interface does not allow kernel executions taking longer then 1s and the VCHIQ GPU service does not wait for the kernel to actually finish.
Depending on whether you run the program as root or "normal" user, different of these interfaces are used.

Can you rerun your program (with and without "sudo") with the VC4CL_DEBUG=system environment variable set and post the (VC4CL) log output?

Without sudo:

VC4CL_DEBUG=system ./perf_fir_filter 
output length max: 199
output length max: 204
working_len_total: 10996
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
allocated working buf: 10996
gpuserv: vc_gpuserv_init: starting initialisation
[VC4CL](perf_fir_filter): [VC4CL] Using VCSM (CMA) for: memory allocation
[VC4CL](perf_fir_filter): [VC4CL] Using VCHI for: kernel execution
...
average time: 0.003389

With sudo:

sudo VC4CL_DEBUG=system ./perf_fir_filter 
output length max: 199
output length max: 204
working_len_total: 10996
clGetDeviceIDs: 0
clCreateContext: 0
clCreateCommandQueue: 0
allocated working buf: 10996
[VC4CL](perf_fir_filter): [VC4CL] Using mailbox for: memory allocation, system queries
[VC4CL](perf_fir_filter): [VC4CL] Using V3D for: kernel execution, profiling, system queries
...
average time: 0.051971

Ok. So performance difference can be explained by different memory allocation types.

I turned on VCHI and now sudo is slow. So this is consistent with the "no sudo".

It looks like VCHI is causing this slowness. After some number of executions, it started calculating very slowly. But the output is still valid (so it is not a timeout):

done: 1 9 0.002897
output: 0.000051934, 0.421165079 
done: 2 0 0.126826
output: 0.000051934, 0.421165079 

Interesting results. I never got to the actual performance comparison of the different memory allocation methods, since non-root execution does not work reliably (see tickets I linked above).
IIRC, for VCSM, the kernel manages the memory and shares it by communicating asynchronously with the firmware. For Mailbox memory management, the kernel is completely unaware of what is going on. So maybe the involvement of the kernel causes the slow-down... Of course it could very well also be that I just mess up the caching flags...

If you are interested in playing around a bit more, you could rerun the program with the VC4CL_CACHE_FORCE environment variable set to 0 (uncached), 1 (host-cached), 2 (GPU-cached) and 3 (both cached), see here.

Generally, I would recommend to run any VC4CL client as root (I know, not so great), allocate memory via the Mailbox and execute kernels via the V3D registers (the defaults for root), since this the most-tested combination by far.

It gets weirder with more tests.

  1. Execute the loop 2 times test
  • without sudo: average time: 0.004408, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
  • with sudo: average time: 0.102403, output: 0.000051934, 0.421165079 0.305830151, 0.193653673
  1. VC4CL_CACHE_FORCE=0 + loop executed 2 times:
  • without sudo: average time: 0.004297, output: -0.002558924, 0.426762044 0.305830151, 0.193653673
  • with sudo (timeout): average time: 60.123802 0.000000000, 0.000000000 0.000000000, 0.000000000
  1. VC4CL_CACHE_FORCE=1 + loop executed 2 times:
  • without sudo: average time: 0.004470, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
  • with sudo: average time: 0.102424, output: 0.000051934, 0.421165079 0.305830151, 0.193653673
  1. VC4CL_CACHE_FORCE=2 + loop executed 2 times:
  • without sudo: average time: 0.004419, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
  • with sudo: average time: 30.062032 (!!!!!), output: 0.000000000, 0.000000000 0.000280094, 0.000072954
  1. VC4CL_CACHE_FORCE=3 + loop executed 2 times:
  • without sudo: average time: 0.004467, output: 0.000000000, 0.000000048 0.000284149, 0.000077057
  • with sudo: average time: 0.102449, output: 0.000051934, 0.421165079 0.305830151, 0.193653673

So it looks like:

  1. non-sudo execution (VCHI) doesn't wait for the result.
  • That's why it is so fast (25x faster).
  • That's why the output is corrupted. It doesn't wait for the second execution to complete, so the buffer with previous results is returned. Or some rubbish returned in case of VC4CL_CACHE_FORCE=0
  1. GPU-cached memory produce rubbish when using with sudo.
  2. Only with sudo and host-based cache produce valid results.

The performance gap between sudo/no-sudo is very important here, because executing very similar filter on CPU (+neon) gives "average time: 0.070378". So it lies somewhere between sudo and non-sudo.

There are number of things I can try to optimise performance, but first of all I would like to make the program stable :)

Some optimisations:

  • Cache filter taps into the local memory
  • Output results into local memory first and then copy them into the global
  • Use map/unmap for external buffer.

Performance-wise, having a quick look at your code, you could try some of these:

  • When the kernel is compiled, the buffer sizes (as well as the decimation parameter) are already known, so setting them via a macro could unroll the inner loop and optimize some arithmetic operations. This of course disallows caching the compiled kernel program to be run with different parameters...
  • If your data size is a multiple of 8/16 elements (or you can pad it to be), you may want to rewrite the kernel to use 8-/16-element vector types (i.e. load and store float16 values). This could greatly improve performance.
  • Even if data cannot be processed completely as vectorized types, loading and storing vectors (and thus having less memory access instructions) alone can give a good performance boost, since usually memory-access is a bottleneck on the VC4.

E.g. this code has only half the memory accesses and takes (without scheduling overhead) about 25% less cycles to execute:

__kernel void fir_filter_process_memory(__global const float2 *restrict input, __global const float2 *restrict taps, const unsigned int taps_len, __global float2 *output, const unsigned int decimation, const unsigned int output_len) {

    for (unsigned int i = 0; i < output_len; i++) {
        int output_offset = get_global_id(0) * output_len + i;
        int input_offset = output_offset * decimation;
        float real0 = 0.0f;
        float imag0 = 0.0f;
        for (unsigned int j = 0; j < taps_len; j++) {
            float2 in = input[input_offset + j];
            float2 tap = taps[j];
            real0 += (in.x * tap.x) - (in.y * tap.y);
            imag0 += (in.x * tap.y) + (in.y * tap.x);
        }
        output[output_offset] = (float2)(real0, imag0);
    }
}

This code could be twice as fast (again ignoring scheduling overhead), but requires the input and taps buffers to be padded to 8 floats and a constant tap-size of 3:

__kernel void fir_filter_process_vectorized(__global const float *restrict input, __global const float *restrict taps, __global float2 *output, const unsigned int decimation, const unsigned int output_len) {

    for (unsigned int i = 0; i < output_len; i++) {
        int output_offset = (get_global_id(0) * output_len + i);
        int input_offset = output_offset * 2 * decimation;
        float8 in = *((__global float8*)(input + input_offset));
        float8 tap = *((__global float8*)taps);
        float real0 = 0.0f;
        float imag0 = 0.0f;
    
        // 1 tap
        real0 += (in.s0 * tap.s0) - (in.s1 * tap.s1);
        imag0 += (in.s0 * tap.s1) + (in.s1 * tap.s0);
        // 2 tap
        real0 += (in.s2 * tap.s2) - (in.s3 * tap.s3);
        imag0 += (in.s2 * tap.s3) + (in.s3 * tap.s2);
        // 3 taps
        real0 += (in.s4 * tap.s4) - (in.s5 * tap.s5);
        imag0 += (in.s4 * tap.s5) + (in.s5 * tap.s4);
    
        output[output_offset] = (float2)(real0, imag0);
    }
}

Ok. I will go with sudo-enabled access and start optimising the code.

This is not related to this issue, but I switched to float8 and got 5x performance boost:

So it looks like it is possible to beat CPU on this task.

I have been extensively testing the timeout issue for the last several days:

  1. It seems mailbox call 0x00030011 returns before the actual computation completes. Similar to the issue with VCHIQ. I've made a loop that runs the same kernel + sleep(1).

Normally computation takes 16k us and lots of execution cycles:

[VC4CL](VC4CL Queue Han): Elapsed time: 16264us
[VC4CL](VC4CL Queue Han): Clock speed: 0
[VC4CL](VC4CL Queue Han): Instruction count: 461
[VC4CL](VC4CL Queue Han): Explicit uniform count: 3
[VC4CL](VC4CL Queue Han): QPUs used: 12
[VC4CL](VC4CL Queue Han): Kernel repetition count: 17
[VC4CL](VC4CL Queue Han): Execution cycles: 38808180

But after some time it returns earlier:

[VC4CL](VC4CL Queue Han): Elapsed time: 548us
[VC4CL](VC4CL Queue Han): Clock speed: 0
[VC4CL](VC4CL Queue Han): Instruction count: 461
[VC4CL](VC4CL Queue Han): Explicit uniform count: 3
[VC4CL](VC4CL Queue Han): QPUs used: 12
[VC4CL](VC4CL Queue Han): Kernel repetition count: 17
[VC4CL](VC4CL Queue Han): Execution cycles: 801568

With much smaller execution cycles. However due to sleep(1) I can submit the same kernel again. It returns earlier, but the execution cycles (which are read from the GPU) stay relatively normal.

[VC4CL](VC4CL Queue Han): Elapsed time: 548us
[VC4CL](VC4CL Queue Han): Clock speed: 0
[VC4CL](VC4CL Queue Han): Instruction count: 461
[VC4CL](VC4CL Queue Han): Explicit uniform count: 3
[VC4CL](VC4CL Queue Han): QPUs used: 12
[VC4CL](VC4CL Queue Han): Kernel repetition count: 17
[VC4CL](VC4CL Queue Han): Execution cycles: 38803212

So it looks like GPU computes the kernel, but for some reason mailbox interface returns earlier.

  1. Buffer size has no effect. I can reduce it and still get these early returns.
  2. Simple kernel works well. I tried the following kernel with the same buffers, input parameters and never seen early returns:
__kernel void fir_filter_process(__global const float *restrict input, __global const float *restrict taps, __global float *output) {

    int output_offset = get_global_id(0) * 2;

    output[output_offset] = 1;
    output[output_offset + 1] = 2;

}

Couple more observations:

  1. If I remove sleep(1), then after early return I cannot run application again. It will crash GPU (?) or saturate some internal buffer in ThreadX? Only power off/on helps.
  2. With sleep(1) I can re-run application and get the same 16k us execution times. So something resets/re-initialize between 2 subsequent executions.

Tried running code similar to add.py:

    for x in range(100):
      start = time.time()
      drv.execute(
        n_threads=n_threads,
        program=code,
        uniforms=uniforms
      )
      elapsed_gpu = time.time() - start
      print('GPU: {:.4f} sec'.format(elapsed_gpu))

using py-videocore and got the same timeout:

GPU: 0.0044 sec
GPU: 0.0044 sec
Traceback (most recent call last):
  File "add.py", line 118, in <module>
    uniforms=uniforms
  File "/usr/local/lib/python3.7/dist-packages/videocore/driver.py", line 238, in execute
    r = self.mailbox.execute_qpu(n_threads, message.address, 0, timeout)
  File "/usr/local/lib/python3.7/dist-packages/videocore/mailbox.py", line 110, in f
    r = self._simple_call(name, tag, req_fmt, res_fmt, list(args))[5:]
  File "/usr/local/lib/python3.7/dist-packages/videocore/mailbox.py", line 98, in _simple_call
    ioctl(self.fd, IOCTL_MAILBOX, buf, True)
TimeoutError: [Errno 110] Connection timed out

Tried bullseye and got timeout after very first execution when executing via MAILBOX.

[VC4CL](VC4CL Queue Han): Mailbox buffer before: 00000020 00000000 00030002 00000008 00000004 00000005 00000000 00000000
[VC4CL](VC4CL Queue Han): Mailbox buffer after: 00000020 80000000 00030002 00000008 80000008 00000005 11e1a300 00000000
[VC4CL](VC4CL Queue Han): Mailbox request: succeeded
[VC4CL](VC4CL Queue Han): Mailbox buffer before: 00000028 00000000 00030011 00000010 00000010 0000000c bebdc290 00000000 00007530 00000000
[VC4CL](VC4CL Queue Han): ioctl_set_msg failed: -1
[VC4CL] Error in mbox_property: Connection timed out

Here is firmware version:

pi@raspberrypi:~ $ sudo vcgencmd version
Oct 29 2021 10:49:08 
Copyright (c) 2012 Broadcom
version b8a114e5a9877e91ca8f26d1a5ce904b2ad3cf13 (clean) (release) (start)

Isn't this also related to raspberrypi/linux#4321?

Can you check how long it actually took to time out (from the start of that particular Mailbox call to the timeout error)?

Isn't this also related to raspberrypi/linux#4321?

Unlikely. I'm executing exactly the same code all the time and it takes ~16264us to execute. On "buster" I've got timeout after several executions. While on "bullseye" it instantly fails.

I think it relates to: raspberrypi/firmware#1582 or at least looks very similar.

Another observation (not sure if related), but GPU firmware becomes corrupted. For example, other modules responsible for changing frequency stop working.