[Question] Object Detection running with UMat and/or OpenCL target noticeably slower

Question

[Question] Object Detection running with UMat and/or OpenCL target noticeably slower

angryGoat500 opened this issue a year ago · 1 comments

Hey everyone

I have a question regarding the Transparent API / Preferable Target and hope someone can help me understand.

My Object Detection program takes a lot longer to process images when using
net.setPreferableTarget(cv2.dnn.DNN_TARGET_OPENCL_FP16)
or
image = cv2.imread(filePath, cv2.COLOR_BGR2RGB) uMat = cv2.UMat(image)

I've created 4 benchmark programs running sequentually, processing the same 10 .jpg files.

My Baseline is a standard openCV object detection programm, not using the setPreferableTarget or UMat class for images.
The second one sets the setPreferableTarget to cv2.dnn.DNN_TARGET_OPENCL_FP16
The third converts images into UMat objects
The fourth sets the setPreferableTarget to cv2.dnn.DNN_TARGET_OPENCL_FP16 and converts images into UMat objects.

I always measured the full processing time starting before I read the image, ending after drawing the labels (excluding writing the output image or detection log) as well as the model inference time with
t, _ = net.getPerfProfile() infTime = (t / cv2.getTickFrequency())

The collected output is as follows:

Benchmark One Full Processing Time: 2.78063s
Benchmark One Model Inference Time: 1.030843s

Benchmark Two Full Processing Time: 3.2567s
Benchmark Two Model Inference Time: 1.12314s

Benchmark Three Full Processing Time: 12.76886s
Benchmark Three Model Inference Time: 10.83879s

Benchmark Four Full Processing Time: 13.43161047s
Benchmark Four Model Inference Time: 11.27375169s

Is there such a large gap between CPU and GPU execution because of the data transferrel between the processing units? Am I missing something crucial?

If this big gap difference can be explained by the data transfer, is there a possibility to "bundle" my workload to reduce the amount of transferrals?

I can provide the full code for these benchmark Programs if they should be helpful.

Thanks in advance!

Answer 1 · 2023-08-17T05:37:52.000Z

There are several reasons for why an execution via VC4CL could be slow, e.g.

The kernel is not very optimized
The kernel is very memory-bound (which is rather slow on the VideoCore IV GPU, esp. read/write-memory)
The measurement includes the kernel compilation time, which can take a while
...

As to bundling the workload, I have no clue about OpenCV, but I think as the actual OpenCL client, it would have to be done there...