Performance issue
Closed this issue · 1 comments
Hello, when I utilize NPU(HTP aka. cDSP) on Snapdragon 8 gen3, I meet some performance problems.
Here is the details.
When we use qhblas_hvx_ah_matrix_vector_mpy_ab in Hexagon SDK qhl_hvx library, we find that it is much slower than directly computing with CPU Arm Neon. The result is shown as below.
cDSP(NPU) [13008, 5120] * [5120, 1] 45ms
CPU [13008, 5120] * [5120, 1] 10ms
After that, we set the power mode to performance mode, the cDSP execution time is a little faster, but still slower than CPU.
cDSP(NPU) [13008, 5120] * [5120, 1] 36ms
CPU [13008, 5120] * [5120, 1] 10ms
I want to know if the result is correct and is compatible with your tests? Looking forward to get your response.
After chatting on Slack, Yixin mentioned that they are using the QNN SDK to run their model and hit this issue but have since resolved it. Closing it as there is no action here.