doe300/VC4CL

The performance problem with OpenCL empty kernel on Raspberry Pi 3 B+

zhaodongmcw opened this issue · 3 comments

I run an empty OpenCL kernel with a resolution of 640x480 on Raspberry Pi 3 B+, the kernel takes about 40ms. The results are as bellow.

platform name: OpenCL for the Raspberry Pi VideoCore IV GPU
The number of devices: 1
device name: VideoCore IV GPU
clEnqueueNDRangeKernel time: 39.9785 ms
clEnqueueReadBuffer time: 4.824 ms

The following is reference code.
cl_kernel2.cl.txt
vecAdd.txt
makefile.txt

You can delete the txt suffix and restore them to cl, cpp, makefile file. Then they can be compiled and run on Raspberry Pi.

The OS of my machine is Linux raspberrypi 4.14.98-v7+ #1200 SMP Tue Feb 12 20:27:48 GMT 2019 armv7l GNU/Linux.

The machine is Raspberry Pi 3 Model B Plus Rev 1.3.

The g++ version is 6.3.0 20170516.
g++ (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Look forward to your reply.

If you run 640x480 work-items and do not explicitly specify the work-group size, 25600 work-groups (with 12 work-items each) are run.
Taking 40ms for running 25600 work-groups gives an overhead of ~1.5us per work-group.
Given a Raspberry Pi 3B+ with 1.4GHz CPU frequency, this results in ~2k CPU cycles overhead per work-group. This value includes all the preparation and memory copying required as well as waiting for the kernel execution to finish.
So if you break it down like this, it doesn't look like that bad of a performance to me.

That said, I am looking into parallelizing the work, e.g. preparing the next work-group while the previous is still running. This might reduce the overhead a bit.

Thank you very much, doe300. I have understood.

Given the new optimization in the commits referenced above, we get a total execution time of 7ms.

Rerunning the above calculation:
Taking 7ms for running 25600 work-groups gives a total execution time of now ~273ns per work-group.
Running at 250MHz, a single QPU instruction takes 4ns.
Given ~40 instructions executed for an empty kernel and the ~70 instructions measured per work-group, we have an overhead of ~30 instructions per work-group. This includes GPU-side cache misses, initial host-side preparation as well as synchronization delay between GPU-side interrupt and host-side polling of interrupt values.

-> There is probably still something to be done by playing around with caching settings, but I think we are good here!