CpuGemmConv2d optimization affects performance on Apple M2/M2 Pro
alvoron opened this issue · 9 comments
PR https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526 makes CpuGemmConv2d slower on Apple M2 / M2 Pro.
The numbers below were collected on M2 Pro.
On mobilenet-v2-1.0-224
CpuGemmConv2d takes 3.18 ms before the PR and 4.12 after the PR was merged.
resnet-50-pytorch
- 16.37 ms before the PR; 19.67 ms after the PR
So, we have 20-30% performance degradation on CNN.
Hi @alvoron
Thanks for reporting this.
Would you please let us know how many inferences/iterations you are running?
I run model 30 sec and calculate average exec time of each operation type.
So, I have 7319 iterations of mobilenet-v2-1.0-224
and 1536 iterations of resnet-50-pytorch
.
Hi @alvoron
The mentioned patch should affect the start-up time, i.e. the first iteration only. I wonder if your runs configure() each time, or configure() only in the first iteration and run() in the remaining ones.
OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via acl_gemm_convolution_fwd_t::create_resource()
method:
https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44
Hi @alvoron
I ran ACL's benchmark_graph_mobilenet_v2 on a device with M2 but I could not see a significant performance degradation.
See below the execution including the patch that you mentioned
% ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000 --example_args='--threads=1,--target=NEON,--type=F32'
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'c5ab4df0c11dc66db47f2070edc719923af3367e'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2 --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file :
MLGO file :
Fast math enabled? : false
Wall clock/Wall clock time: AVG=6620.1732 us, STDDEV=2.62 %, MIN=6594.0000 us, MAX=10888.0000 us, MEDIAN=6608.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)
And this is without the patch
ComputeLibrary % ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000 --example_args='--threads=1,--target=NEON,--type=F32'
Version = arm_compute_version=v0.0-unreleased Build options: {'neon': '1', 'opencl': '0', 'benchmark_tests': '0', 'examples': '0', 'benchmark_examples': '1', 'os': 'macos', 'arch': 'armv8a', 'multi_isa': '0', 'logging': '0', 'asserts': '0', 'standalone': '0', 'validation_tests': '0', 'build': 'native'} Git hash=b'4a9dbedfbfa66c2612c7461e60cd867b8aea825b'
CommandLine = ./build/tests/benchmark_graph_mobilenet_v2_reverted --iterations=1000 --example_args=--threads=1,--target=NEON,--type=F32
Iterations = 1000
Running [0] 'Examples/benchmark_graph_mobilenet_v2_reverted'
Threads : 1
Target : Neon
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Cache enabled? : false
Tuner mode : Normal
Tuner file :
MLGO file :
Fast math enabled? : false
Wall clock/Wall clock time: AVG=6600.4505 us, STDDEV=0.88 %, MIN=6581.0000 us, MAX=8123.0000 us, MEDIAN=6596.0000 us
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 6 second(s)
6620.1732 us - AVG=6600.4505 us = 19.7227 us
19.7227 us / 6620.1732 us = 0.003
Would you please confirm if you experience the problem on other devices?
Can you please share the models you are running? Are there tflite files?
OpenVINO uses oneDNN to call ACL's convolution. It seems oneDNN calls configure once via
acl_gemm_convolution_fwd_t::create_resource()
method: https://github.com/openvinotoolkit/oneDNN/blob/f82148befdbdc9576ec721c9d500155ee4de8060/src/cpu/acl/acl_gemm_convolution.hpp#L44
With DNNL_VERBOSE
enabled, is OpenVINO recreating the resource or is it getting oneDNN cache hits? Some frameworks have their own caching mechanisms
It seems the issue could be reproduced via benchdnn
without OpenVINO.
ACL build command:
scons neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all arch=arm64-v8.2-a build=native --jobs=8 os=macos build=native compiler_cache=ccache compiler_prefix="/Library/Developer/CommandLineTools/usr/bin/" --silent fixed_format_kernels=True
onednn configure command (run in onednn root dir):
ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.dylib -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_core.dylib -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.dylib
benchdnn build command:
cmake --build build --target benchdnn --parallel 7
The reproducer:
DYLD_LIBRARY_PATH=$PWD/../ComputeLibrary/build ./build/tests/benchdnn/benchdnn --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic1280oc1001_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0
On M2 Pro I've got min(ms):0.255333 avg(ms):0.357945
on ACL SHA c5ab4df0c11dc66db47f2070edc719923af3367e
and min(ms):0.042875 avg(ms):0.0624329
on SHA 4a9dbedfbfa66c2612c7461e60cd867b8aea825b
.
@morgolock could you please try to repeat these steps?
UPD:
Couple comments:
- Please take oneDNN fork that is used by OpenVINO: https://github.com/openvinotoolkit/oneDNN (SHA -
4e29b771fcdfab5bdb219a495e694d6206e52b67
) - You need to apply 2 small changes to oneDNN to adopt new version of ACL: openvinotoolkit/oneDNN@19bb9f2...d76046a
- I reproduced the issue using
benchdnn
on Mac M1 mini:total perf: min(ms):0.273542 avg(ms):0.309104
onc5ab4df0c11dc66db47f2070edc719923af3367e
andtotal perf: min(ms):0.0366251 avg(ms):0.0638425
on4a9dbedfbfa66c2612c7461e60cd867b8aea825b