Inference time
Kulikovpavel opened this issue · 21 comments
By some reason inference time with IR model version 10 (new one) ten times slower on CPU with this wheel, that when using openvino toolkit integrated opencv.
Any idea why?
Nope
Someone must write a standard inference speed test, that we could use to measure it.
Also, build info should be compared. And CPU available features could be the case.
I'll test current wheel with latest open-vino release on my CPU i7-8550U CPU @ 1.80GHz, but I do not know exactly when.
I've created two Ubuntu 18.04 LTS instances with multipass launch -c 4 -m 4G
.
For first instance, I've downloaded and installed l_openvino_toolkit_p_2020.1.023.tgz
For second I used opencv_python_inference_engine-4.2.0.3-py3-none-manylinux1_x86_64.whl
Inference speed was tested and measured as described here.
Code for models downloading on OpenVINO instance:
#!/bin/bash
# urls, filenames and checksums are from:
# + <https://github.com/opencv/open_model_zoo/blob/2020.1/models/intel/text-detection-0004/model.yml>
declare -a models=("text-detection-0004.xml"
"text-detection-0004.bin")
url_start="https://download.01.org/opencv/2020/openvinotoolkit/2020.1/open_model_zoo/models_bin/1"
for i in "${models[@]}"; do
if [ ! -f $i ]; then
wget "${url_start}/${i%.*}/FP32/${i}"
else
sha256sum -c "${i}.sha256sum"
fi
done
Results | |
---|---|
OpenVINO | 173 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
My Wheel | 173 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Conclusion -- no difference.
My best guess for your problem -- you were using different Target/Backend combinations or your CPU has AVX512 instruction set (it enabled in OpenVINO and disabled in my wheel) or you tangled in envars. Try to repeat my steps and check /proc/cpuinfo
for avx512.
Thanks for checking
My model is se_resnext50, converted from pytorch_toolbelt -> onnx -> IR
( https://drive.google.com/drive/folders/1ugZ7KKkS7IcHazdMulWAQUpA6DwWhkjh?usp=sharing )
(1,3,224,224) input size
With ver7 models, that i have previously (not se_resnext50), all is ok, same speed. But with new one, 10, difference in 10x
300ms for opencv-python-inference-engine, 30ms for openvino version on the same machine
Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz, no avx512
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht pbe syscall nx pdpe1gb lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq dtes64 ds_cpl ssse3 sdbg fma cx16 xtpr pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti fsgsbase bmi1 hle avx2 bmi2 erms rtm xsaveopt arat
I have noticed that, when my production code suddenly became 10x slower with that model.
The same target and backend, like in your test
With your net:
import cv2
import numpy as np
xml_model_path = "se_net.xml"
net = cv2.dnn.readNet(xml_model_path, xml_model_path[:-3] + 'bin')
blob = (np.random.standard_normal((1, 3, 224, 224)) * 255).astype(np.uint8)
net.setInput(blob)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
_ = net.forward()
%timeit _ = net.forward()
OpenVINO: 49.2 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Wheel: 542 ms ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So yes, there is a 10 fold inference speed difference on that net. But I have no idea why.
Maybe because se_resnext50 has some new layers, which are fast with some third party libraries which I did not compile:
Other third-party libraries:
Intel IPP: 2019.0.0 Gold [2019.0.0]
at: /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/build_release/3rdparty/ippicv/ippicv_lnx/icv
Intel IPP IW: sources (2019.0.0)
at: /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/build_release/3rdparty/ippicv/ippicv_lnx/iw
Inference Engine: YES (2020010000 / 2.1.0)
* libs: /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/inference_engine/lib/intel64/libinference_engine_c_api.so
* includes: /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/inference_engine/include
nGraph: YES (0.27.1-rc.0+b0bb801)
* libs: /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/ngraph/lib/libngraph.so
* includes: /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/ngraph/include
Custom HAL: NO
Protobuf: build (3.5.1)
I'll try few different builds
Well, I managed to repeat third party lib setup:
Other third-party libraries:
Intel IPP: 2019.0.0 Gold [2019.0.0]
at: /home/ubuntu/opencv-python-inference-engine/build/opencv/3rdparty/ippicv/ippicv_lnx/icv
Intel IPP IW: sources (2019.0.0)
at: /home/ubuntu/opencv-python-inference-engine/build/opencv/3rdparty/ippicv/ippicv_lnx/iw
Lapack: NO
Inference Engine: YES (2020010000 / Unknown)
* libs: /home/ubuntu/opencv-python-inference-engine/dldt/bin/intel64/Release/lib/libinference_engine.so
* includes: /home/ubuntu/opencv-python-inference-engine/dldt/inference-engine/include
nGraph: YES (0.27.1-rc.0+b0bb801)
* libs: /home/ubuntu/opencv-python-inference-engine/dldt/bin/intel64/Release/lib/libngraph.so
* includes: /home/ubuntu/opencv-python-inference-engine/dldt/ngraph/src
Custom HAL: NO
Protobuf: build (3.5.1)
But nothing changed.
Maybe it is somehow related to MKL DNN, TBB libraries?
It could be. I.e., dldt compiled with -DGEMM=MKL
.
I doubt that TBB version difference could cause such gap in inference speed.
But I'll better go and ask dldt developers about it and how I could get dldt build info.
My question about dldt build info: https://stackoverflow.com/questions/60704887/q-how-can-i-get-dldt-buildinfo
Also, may be related to this: openvinotoolkit/openvino#166
I'll write here if I found something new
So, I managed to solve it in the first perspective.
I had to compile "dld"t with -D GEMM=MKL -D MKLROOT .....
like in openvinotoolkit/openvino#327
Now the inference speed of provided NN is the same with the use of OpenVINO or my wheel -- ~48ms, but it is +125MB (libmkml_gnu.so
) to wheel size, which is bad.
OpenVINO has some 30MB mkl_tiny
, but no instruction on how to build it.
Building mkl_tiny is not an option: oneapi-src/oneDNN#674
But I built wheel with OpenBLAS and it has the same inference speed with se_net
. Also, OpenBLAS is much smaller.
Seems MKL should be faster ( https://software.intel.com/en-us/articles/performance-comparison-of-openblas-and-intel-math-kernel-library-in-r )
But if you are sure, i'm ok with that, thank for your work!
Well I have problems to compile OpenBLAS in the fastest way possible. Now the fastest OpenBLAS variant is to use precompiled ubuntu lib:
GEMM | Inference time (ms) | Lib size (MB) |
---|---|---|
JIT | 851 | - |
OpenBLAS 0.2.20 | 520 | 13.7 |
OpenBLAS 0.3.9 | 160 | 16.4 |
MKL (MKL-DNN) | 60 | 125 |
OpenVINO (mkl_tiny) | 55 | 30 |
OpenBLAS 0.2.20 (binary) | 60 | 32 |
OpenMathLib/OpenBLAS@9f67d03 | 55 | 4.2 |
Inference time for se_net
OpenBLAS also requires some additional libs like gfortran, +2MB in sum.
I suggest that this wheel is 10% slower with your "se_net" than OpenVINO.
And this without-gfortran-wheel is 2 times slower than OpenVINO, but libopenblas.so
only 3.9MB.
Any feedback is welcome
UPD: With the help of OpenBLAS contributors, I've managed to compile 4.2MB lib with approximately same inference speed as mkl_tiny (for se_net).
Please refer here for details: OpenMathLib/OpenBLAS#2528
Great update, thanks!
FYI
New version of openvino
`
CPU Plugin
Removed the mkltiny dependency. This allowed to significantly decrease the binary distribution size required for inference on CPU.
@Kulikovpavel aha, thx. Will need to do something with this at the weekend.
@Kulikovpavel now one could compile dldt with GEMM=JIT and have OpenBLAS compatible speed on your net (and save 4.2 MB):
GEMM | Inference time (ms) | Lib size (MB) |
---|---|---|
JIT | 52 | - |
OpenMathLib/OpenBLAS@9f67d03 | 51 | 4.2 |
Hi, @banderlog , i have the same problem with performance of your version of opencv in new release with the same network as above
Version 4.3.0.2, could you check that?
@Kulikovpavel well, I'll check it, in case I missed something, but I am running tests for inference speed with your network each time I build the library. Before this release I checked inference speed of JIT vs OpenBLAS and left OpenBLAS (50 vs 40 ms approximately).
You have the same network and setup as before?
As you may see from the below table, all as it should be. I even tested separately wheels from repo and from pypi.
Version | Inference speed |
---|---|
4.3.0.2 | 99.5 ms ± 1.22 ms |
4.3.0.1 | 100 ms ± 1.53 ms |
4.2.0.4 | 109 ms ± 3.71 ms |
4.2.0.3 | 1.07 s ± 13.3 ms |
In the message above I reported smaller numbers, but I've achieved them in a different environment (lxd linux container).
Code for replication:
# sudo snap install multipass
multipass launch -c 6 -d 10G -m 7G -n test
multipass shell test
sudo apt-get update
sudo apt install git python3 virtualenv
git clone https://github.com/banderlog/opencv-python-inference-engine
cd opencv-python-inference-engine/tests/
# wget https://files.pythonhosted.org/packages/f0/ee/36d75596ce0b6212821510efb56dfca962f4add3fdaf49345bc93920a984/opencv_python_inference_engine-4.3.0.2-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.3.0.2/opencv_python_inference_engine-4.3.0.2-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.3.0.1/opencv_python_inference_engine-4.3.0.1-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.2.0.4/opencv_python_inference_engine-4.2.0.4-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.2.0.3/opencv_python_inference_engine-4.2.0.3-py3-none-manylinux1_x86_64.whl
# first run will take a lot of time, because it will install all needed python packages
./prepare_and_run_tests.sh opencv_python_inference_engine-4.3.0.2*
./prepare_and_run_tests.sh opencv_python_inference_engine-4.3.0.1*
./prepare_and_run_tests.sh opencv_python_inference_engine-4.2.0.4*
./prepare_and_run_tests.sh opencv_python_inference_engine-4.2.0.3*
If you will run it, you should receive different inference times, but 4.2.0.3's must be x10 greater than others.
@Kulikovpavel your performance problem still persist?