ARM-software/ComputeLibrary

CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro

alvoron opened this issue · 9 comments

PR https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526 makes CpuGemmConv2d slower on Apple M2 / M2 Pro.

The numbers below were collected on M2 Pro.
On mobilenet-v2-1.0-224 CpuGemmConv2d takes 3.18 ms before the PR and 4.12 after the PR was merged.
resnet-50-pytorch - 16.37 ms before the PR; 19.67 ms after the PR

So, we have 20-30% performance degradation on CNN.

@sicong-li-arm @gunes-arm @aniraj01

Hi @alvoron

Thanks for reporting this.

Would you please let us know how many inferences/iterations you are running?

I run model 30 sec and calculate average exec time of each operation type.
So, I have 7319 iterations of mobilenet-v2-1.0-224 and 1536 iterations of resnet-50-pytorch.

Hi @alvoron

The mentioned patch should affect the start-up time, i.e. the first iteration only. I wonder if your runs configure() each time, or configure() only in the first iteration and run() in the remaining ones.

OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource() method:
https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44

Hi @alvoron

I ran ACL's benchmark_graph_mobilenet_v2 on a device with M2 but I could not see a significant performance degradation.

See below the execution including the patch that you mentioned

% ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000  --example_args='--threads=1,--target=NEON,--type=F32'
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'c5ab4df0c11dc66db47f2070edc719923af3367e'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32 
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=6620.1732 us, STDDEV=2.62 %, MIN=6594.0000 us, MAX=10888.0000 us, MEDIAN=6608.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)

And this is without the patch

ComputeLibrary % ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000  --example_args='--threads=1,--target=NEON,--type=F32' 
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'4a9dbedfbfa66c2612c7461e60cd867b8aea825b'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32 
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2_reverted'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file : 
MLGO file : 
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=6600.4505 us, STDDEV=0.88 %, MIN=6581.0000 us, MAX=8123.0000 us, MEDIAN=6596.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)

6620.1732 us - AVG=6600.4505 us = 19.7227 us
19.7227 us / 6620.1732 us = 0.003

Would you please confirm if you experience the problem on other devices?
Can you please share the models you are running? Are there tflite files?

OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource() method: https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44

With DNNL_VERBOSE enabled, is OpenVINO recreating the resource or is it getting oneDNN cache hits? Some frameworks have their own caching mechanisms

It seems the issue could be reproduced via benchdnn without OpenVINO.

ACL build command:
scons neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all arch=arm64-v8.2-a build=native --jobs=8 os=macos build=native compiler_cache=ccache compiler_prefix="/Library/Developer/CommandLineTools/usr/bin/" --silent fixed_format_kernels=True

onednn configure command (run in onednn root dir):
ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.dylib -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_core.dylib -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.dylib

benchdnn build command:
cmake --build build --target benchdnn --parallel 7

The reproducer:
DYLD_LIBRARY_PATH=$PWD/../ComputeLibrary/build ./build/tests/benchdnn/benchdnn --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic1280oc1001_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0

On M2 Pro I've got min(ms):0.255333 avg(ms):0.357945 on ACL SHA c5ab4df0c11dc66db47f2070edc719923af3367e and min(ms):0.042875 avg(ms):0.0624329 on SHA 4a9dbedfbfa66c2612c7461e60cd867b8aea825b.

@morgolock could you please try to repeat these steps?

UPD:
Couple comments:

  1. Please take oneDNN fork that is used by OpenVINO: https://github.com/openvinotoolkit/oneDNN (SHA - 4e29b771fcdfab5bdb219a495e694d6206e52b67)
  2. You need to apply 2 small changes to oneDNN to adopt new version of ACL: openvinotoolkit/oneDNN@19bb9f2...d76046a
  3. I reproduced the issue using benchdnn on Mac M1 mini: total perf: min(ms):0.273542 avg(ms):0.309104 on c5ab4df0c11dc66db47f2070edc719923af3367e and total perf: min(ms):0.0366251 avg(ms):0.0638425 on 4a9dbedfbfa66c2612c7461e60cd867b8aea825b

Hi @alvoron

Thanks for reporting this performance regression and providing so much detail.

We have merged a patch fixing the problem into the main development branch and we will do a patch release of 24.02 including the fix mentioned above.

Hope this helps

Hi @alvoron

Closing this as it was fixed in 24.02.1

Please reopen if you require further assistance.