banderlog/opencv-python-inference-engine

Inference time

Kulikovpavel opened this issue · 21 comments

By some reason inference time with IR model version 10 (new one) ten times slower on CPU with this wheel, that when using openvino toolkit integrated opencv.

Any idea why?

Nope

Someone must write a standard inference speed test, that we could use to measure it.
Also, build info should be compared. And CPU available features could be the case.

I'll test current wheel with latest open-vino release on my CPU i7-8550U CPU @ 1.80GHz, but I do not know exactly when.

I've created two Ubuntu 18.04 LTS instances with multipass launch -c 4 -m 4G.

For first instance, I've downloaded and installed l_openvino_toolkit_p_2020.1.023.tgz
For second I used opencv_python_inference_engine-4.2.0.3-py3-none-manylinux1_x86_64.whl

Inference speed was tested and measured as described here.

Code for models downloading on OpenVINO instance:

#!/bin/bash

# urls, filenames and checksums are from:
#  + <https://github.com/opencv/open_model_zoo/blob/2020.1/models/intel/text-detection-0004/model.yml>
declare -a models=("text-detection-0004.xml"
                   "text-detection-0004.bin")

url_start="https://download.01.org/opencv/2020/openvinotoolkit/2020.1/open_model_zoo/models_bin/1"

for i in "${models[@]}"; do
    if [ ! -f $i ]; then
        wget "${url_start}/${i%.*}/FP32/${i}"
    else
        sha256sum -c "${i}.sha256sum"
    fi
done
Results
OpenVINO 173 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
My Wheel 173 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion -- no difference.

My best guess for your problem -- you were using different Target/Backend combinations or your CPU has AVX512 instruction set (it enabled in OpenVINO and disabled in my wheel) or you tangled in envars. Try to repeat my steps and check /proc/cpuinfo for avx512.

Thanks for checking

My model is se_resnext50, converted from pytorch_toolbelt -> onnx -> IR
( https://drive.google.com/drive/folders/1ugZ7KKkS7IcHazdMulWAQUpA6DwWhkjh?usp=sharing )
(1,3,224,224) input size

With ver7 models, that i have previously (not se_resnext50), all is ok, same speed. But with new one, 10, difference in 10x

300ms for opencv-python-inference-engine, 30ms for openvino version on the same machine

Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz, no avx512

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht pbe syscall nx pdpe1gb lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq dtes64 ds_cpl ssse3 sdbg fma cx16 xtpr pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti fsgsbase bmi1 hle avx2 bmi2 erms rtm xsaveopt arat

I have noticed that, when my production code suddenly became 10x slower with that model.
The same target and backend, like in your test

With your net:

import cv2
import numpy as np

xml_model_path = "se_net.xml"
net = cv2.dnn.readNet(xml_model_path, xml_model_path[:-3] + 'bin')
blob = (np.random.standard_normal((1, 3, 224, 224)) * 255).astype(np.uint8)
net.setInput(blob)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
_ = net.forward()

%timeit _ = net.forward()

OpenVINO: 49.2 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Wheel: 542 ms ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So yes, there is a 10 fold inference speed difference on that net. But I have no idea why.

Maybe because se_resnext50 has some new layers, which are fast with some third party libraries which I did not compile:

 Other third-party libraries:
    Intel IPP:                   2019.0.0 Gold [2019.0.0]
           at:                   /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/build_release/3rdparty/ippicv/ippicv_lnx/icv
    Intel IPP IW:                sources (2019.0.0)
              at:                /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/build_release/3rdparty/ippicv/ippicv_lnx/iw
    Inference Engine:            YES (2020010000 / 2.1.0)
        * libs:                  /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/inference_engine/lib/intel64/libinference_engine_c_api.so
        * includes:              /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/inference_engine/include
    nGraph:                      YES (0.27.1-rc.0+b0bb801)
        * libs:                  /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/ngraph/lib/libngraph.so
        * includes:              /home/jenkins/workspace/OpenCV/OpenVINO/2019R4/build/ubuntu18/ie/ngraph/include
    Custom HAL:                  NO
    Protobuf:                    build (3.5.1)

I'll try few different builds

Well, I managed to repeat third party lib setup:

  Other third-party libraries:
    Intel IPP:                   2019.0.0 Gold [2019.0.0]
           at:                   /home/ubuntu/opencv-python-inference-engine/build/opencv/3rdparty/ippicv/ippicv_lnx/icv
    Intel IPP IW:                sources (2019.0.0)
              at:                /home/ubuntu/opencv-python-inference-engine/build/opencv/3rdparty/ippicv/ippicv_lnx/iw
    Lapack:                      NO
    Inference Engine:            YES (2020010000 / Unknown)
        * libs:                  /home/ubuntu/opencv-python-inference-engine/dldt/bin/intel64/Release/lib/libinference_engine.so
        * includes:              /home/ubuntu/opencv-python-inference-engine/dldt/inference-engine/include
    nGraph:                      YES (0.27.1-rc.0+b0bb801)
        * libs:                  /home/ubuntu/opencv-python-inference-engine/dldt/bin/intel64/Release/lib/libngraph.so
        * includes:              /home/ubuntu/opencv-python-inference-engine/dldt/ngraph/src
    Custom HAL:                  NO
    Protobuf:                    build (3.5.1)

But nothing changed.

Maybe it is somehow related to MKL DNN, TBB libraries?

It could be. I.e., dldt compiled with -DGEMM=MKL.
I doubt that TBB version difference could cause such gap in inference speed.

But I'll better go and ask dldt developers about it and how I could get dldt build info.

My question about dldt build info: https://stackoverflow.com/questions/60704887/q-how-can-i-get-dldt-buildinfo

Also, may be related to this: openvinotoolkit/openvino#166

I'll write here if I found something new

So, I managed to solve it in the first perspective.

I had to compile "dld"t with -D GEMM=MKL -D MKLROOT ..... like in openvinotoolkit/openvino#327

Now the inference speed of provided NN is the same with the use of OpenVINO or my wheel -- ~48ms, but it is +125MB (libmkml_gnu.so) to wheel size, which is bad.

OpenVINO has some 30MB mkl_tiny, but no instruction on how to build it.

Building mkl_tiny is not an option: oneapi-src/oneDNN#674

But I built wheel with OpenBLAS and it has the same inference speed with se_net. Also, OpenBLAS is much smaller.

Seems MKL should be faster ( https://software.intel.com/en-us/articles/performance-comparison-of-openblas-and-intel-math-kernel-library-in-r )

But if you are sure, i'm ok with that, thank for your work!

Well I have problems to compile OpenBLAS in the fastest way possible. Now the fastest OpenBLAS variant is to use precompiled ubuntu lib:

GEMM Inference time (ms) Lib size (MB)
JIT 851 -
OpenBLAS 0.2.20 520 13.7
OpenBLAS 0.3.9 160 16.4
MKL (MKL-DNN) 60 125
OpenVINO (mkl_tiny) 55 30
OpenBLAS 0.2.20 (binary) 60 32
OpenMathLib/OpenBLAS@9f67d03 55 4.2

Inference time for se_net
OpenBLAS also requires some additional libs like gfortran, +2MB in sum.

I suggest that this wheel is 10% slower with your "se_net" than OpenVINO.

And this without-gfortran-wheel is 2 times slower than OpenVINO, but libopenblas.so only 3.9MB.

Any feedback is welcome

UPD: With the help of OpenBLAS contributors, I've managed to compile 4.2MB lib with approximately same inference speed as mkl_tiny (for se_net).

Please refer here for details: OpenMathLib/OpenBLAS#2528

Solved with v4.2.0.4 release

Great update, thanks!

FYI

New version of openvino
`

CPU Plugin
Removed the mkltiny dependency. This allowed to significantly decrease the binary distribution size required for inference on CPU.

@Kulikovpavel aha, thx. Will need to do something with this at the weekend.

@Kulikovpavel now one could compile dldt with GEMM=JIT and have OpenBLAS compatible speed on your net (and save 4.2 MB):

GEMM Inference time (ms) Lib size (MB)
JIT 52 -
OpenMathLib/OpenBLAS@9f67d03 51 4.2

Hi, @banderlog , i have the same problem with performance of your version of opencv in new release with the same network as above
Version 4.3.0.2, could you check that?

@Kulikovpavel well, I'll check it, in case I missed something, but I am running tests for inference speed with your network each time I build the library. Before this release I checked inference speed of JIT vs OpenBLAS and left OpenBLAS (50 vs 40 ms approximately).

You have the same network and setup as before?

As you may see from the below table, all as it should be. I even tested separately wheels from repo and from pypi.

Version Inference speed
4.3.0.2 99.5 ms ± 1.22 ms
4.3.0.1 100 ms ± 1.53 ms
4.2.0.4 109 ms ± 3.71 ms
4.2.0.3 1.07 s ± 13.3 ms

In the message above I reported smaller numbers, but I've achieved them in a different environment (lxd linux container).

Code for replication:

# sudo snap install multipass
multipass launch -c 6 -d 10G -m 7G -n test
multipass shell test

sudo apt-get update
sudo apt install git python3 virtualenv

git clone https://github.com/banderlog/opencv-python-inference-engine
cd opencv-python-inference-engine/tests/

# wget https://files.pythonhosted.org/packages/f0/ee/36d75596ce0b6212821510efb56dfca962f4add3fdaf49345bc93920a984/opencv_python_inference_engine-4.3.0.2-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.3.0.2/opencv_python_inference_engine-4.3.0.2-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.3.0.1/opencv_python_inference_engine-4.3.0.1-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.2.0.4/opencv_python_inference_engine-4.2.0.4-py3-none-manylinux1_x86_64.whl
wget https://github.com/banderlog/opencv-python-inference-engine/releases/download/v4.2.0.3/opencv_python_inference_engine-4.2.0.3-py3-none-manylinux1_x86_64.whl

# first run will take a lot of time, because it will install all needed python packages
./prepare_and_run_tests.sh opencv_python_inference_engine-4.3.0.2*
./prepare_and_run_tests.sh opencv_python_inference_engine-4.3.0.1*
./prepare_and_run_tests.sh opencv_python_inference_engine-4.2.0.4*
./prepare_and_run_tests.sh opencv_python_inference_engine-4.2.0.3*

If you will run it, you should receive different inference times, but 4.2.0.3's must be x10 greater than others.

@Kulikovpavel your performance problem still persist?