[Question] Object Detection running with UMat and/or OpenCL target noticeably slower
angryGoat500 opened this issue · 1 comments
Hey everyone
I have a question regarding the Transparent API / Preferable Target and hope someone can help me understand.
My Object Detection program takes a lot longer to process images when using
net.setPreferableTarget(cv2.dnn.DNN_TARGET_OPENCL_FP16)
or
image = cv2.imread(filePath, cv2.COLOR_BGR2RGB) uMat = cv2.UMat(image)
I've created 4 benchmark programs running sequentually, processing the same 10 .jpg files.
My Baseline is a standard openCV object detection programm, not using the setPreferableTarget or UMat class for images.
The second one sets the setPreferableTarget to cv2.dnn.DNN_TARGET_OPENCL_FP16
The third converts images into UMat objects
The fourth sets the setPreferableTarget to cv2.dnn.DNN_TARGET_OPENCL_FP16 and converts images into UMat objects.
I always measured the full processing time starting before I read the image, ending after drawing the labels (excluding writing the output image or detection log) as well as the model inference time with
t, _ = net.getPerfProfile() infTime = (t / cv2.getTickFrequency())
The collected output is as follows:
Benchmark One Full Processing Time: 2.78063s
Benchmark One Model Inference Time: 1.030843s
Benchmark Two Full Processing Time: 3.2567s
Benchmark Two Model Inference Time: 1.12314s
Benchmark Three Full Processing Time: 12.76886s
Benchmark Three Model Inference Time: 10.83879s
Benchmark Four Full Processing Time: 13.43161047s
Benchmark Four Model Inference Time: 11.27375169s
Is there such a large gap between CPU and GPU execution because of the data transferrel between the processing units? Am I missing something crucial?
If this big gap difference can be explained by the data transfer, is there a possibility to "bundle" my workload to reduce the amount of transferrals?
I can provide the full code for these benchmark Programs if they should be helpful.
Thanks in advance!
There are several reasons for why an execution via VC4CL could be slow, e.g.
- The kernel is not very optimized
- The kernel is very memory-bound (which is rather slow on the VideoCore IV GPU, esp. read/write-memory)
- The measurement includes the kernel compilation time, which can take a while
- ...
As to bundling the workload, I have no clue about OpenCV, but I think as the actual OpenCL client, it would have to be done there...