NEPooling3dLayer performance issue
alvoron opened this issue · 8 comments
Output of 'strings libarm_compute.so | grep arm_compute_version':
arm_compute_version=v24.02.1 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'arch': 'armv8.6-a', 'Werror': 'false', 'validation_tests': '1', 'os': 'macos', 'build': 'native', 'fixed_format_kernels': '1'} Git hash=b'f2eda6665c12d568e179f5b0e7a24ccdc0ac824d'
Platform:
Apple M2 Pro
Operating System:
macOS 13.4
Problem description:
NEPooling3dLayer
provides twice much latency rather than reference C++ pooling implementation: 6.5 ms vs 3.5 ms.
Reproducer
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "tests/SimpleTensor.h"
#include "arm_compute/runtime/Tensor.h"
#include "utils/TypePrinter.h"
#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"
#include <string>
#include <chrono>
using namespace std;
using namespace arm_compute;
using namespace arm_compute::test;
int main()
{
DataLayout dataLayout = DataLayout::NDHWC;
TensorShape inTensorShape = TensorShape(192, 28, 28, 40, 1);
TensorShape outTensorShape = inTensorShape;
Tensor inputt;
Tensor outputt;
inputt.allocator()->init(TensorInfo(inTensorShape, 1, DataType::F32, dataLayout));
outputt.allocator()->init(TensorInfo(outTensorShape, 1, DataType::F32, dataLayout));
Pooling3dLayerInfo pool3d_info;
pool3d_info.pool_type = PoolingType::MAX;
pool3d_info.exclude_padding = true;
pool3d_info.pool_size = arm_compute::Size3D(3, 3, 3);
pool3d_info.stride = arm_compute::Size3D(1, 1, 1);
pool3d_info.padding = arm_compute::Padding3D(1, 1, 1, 1, 1, 1);
pool3d_info.round_type = DimensionRoundingType::FLOOR;
NEPooling3dLayer pooling;
pooling.configure(&inputt, &outputt, pool3d_info);
inputt.allocator()->allocate();
outputt.allocator()->allocate();
AssetsLibrary library(".", std::random_device()());
std::uniform_real_distribution<> distribution{ 0.0f, 10.0f };
library.fill(Accessor(inputt), distribution, 0);
//warm-up
pooling.run();
std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; i++) pooling.run();
std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
std::cout << "time: " << total_duration / 100 << std::endl;
}
How reproducer was built
clang++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute -lAssetsLibrary.o -lRawTensor.o -lExceptions.o -std=c++17
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Could you please review potential performance issues in NEPooling3dLayer
?
I prepared a benchdnn reference reproducer and checked it on Ampere server.
Benchdnn
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_RULE_MESSAGES=OFF -DONEDNN_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)
./build/tests/benchdnn/benchdnn --mode=P --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=pooling_max --dt=f32:f32 --tag=acdeb --attr-scratchpad=user mb1ic192_id40od40kd3sd1dd0pd1_ih28oh28kh3sh1dh0ph1_iw28ow28kw3sw1dw0pw1
The last benchdnn command gives me min(ms):0.673584 avg(ms):0.787748
on Ampere.
ACL
scons neon=1 opencl=0 openmp=1 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True validation_tests=1
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_pooling.cpp -o acl_pooling -L./ComputeLibrary/build/ -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
LD_LIBRARY_PATH=ComputeLibrary/build ./acl_pooling
The last command gives me 2086
on Ampere.
Hi @alvoron
Could you please try rebuilding the library with openmp=1 cppthreads=0
?
Hope this helps
I rebuilt ACL:
arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '0', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'
and got 2072
on Ampere, so the issue still remains.
P.S.
Also I wasn't able to build ACL with validation_tests=1
and openmp=1
because of undefined reference issue:
/usr/bin/ld: build/tests/validation/UNIT/CPPScheduler.o: in function `UNITSuite::CPPSchedulerSuite::RethrowException::do_run()':
CPPScheduler.cpp:(.text+0xd0): undefined reference to `arm_compute::CPPScheduler::CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x150): undefined reference to `arm_compute::CPPScheduler::set_num_threads(unsigned int)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x160): undefined reference to `arm_compute::CPPScheduler::schedule(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&)'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x4a4): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x59c): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
/usr/bin/ld: CPPScheduler.cpp:(.text+0x684): undefined reference to `arm_compute::CPPScheduler::~CPPScheduler()'
That's why I set validation_tests=0
and deleted inputt
filling logic from the reproducer. I believe it shouldn't affect the performance.
Hi @alvoron
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?
Hi @alvoron
I made some changes to our validation suite to assess the performance, see the results below, neon backend is much faster than our reference code.
ComputeLibrary % ./build/tests/arm_compute_validation "--filter=.*Pooling3d.*" --mode=NIGHTLY --threads=4
...
Running [337] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 873
reference time: 50789
Wall clock/Wall clock time: AVG=32352.0000 us
Running [338] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 1006
reference time: 56723
Wall clock/Wall clock time: AVG=38709.0000 us
Running [339] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 1049
reference time: 56795
Wall clock/Wall clock time: AVG=38914.0000 us
Running [340] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 918
reference time: 51994
Wall clock/Wall clock time: AVG=34195.0000 us
Running [341] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=2x2x2:Stride=2x1x1:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 934
reference time: 51818
Wall clock/Wall clock time: AVG=34168.0000 us
Running [342] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 661
reference time: 21681
Wall clock/Wall clock time: AVG=7178.0000 us
Running [343] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=0,0,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 662
reference time: 21722
Wall clock/Wall clock time: AVG=7316.0000 us
Running [344] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=1:DataType=F32'
neon time: 733
reference time: 25640
Wall clock/Wall clock time: AVG=8681.0000 us
Running [345] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,1,1,1,1:ExcludePadding=0:DataType=F32'
neon time: 704
reference time: 25464
Wall clock/Wall clock time: AVG=8755.0000 us
Running [346] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=1:DataType=F32'
neon time: 648
reference time: 22707
Wall clock/Wall clock time: AVG=7663.0000 us
Running [347] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x2x2:Padding=1,1,0,0,0,0:ExcludePadding=0:DataType=F32'
neon time: 661
reference time: 22717
Wall clock/Wall clock time: AVG=7742.0000 us
Running [348] 'NEON/Pooling3dLayer/Float/FP32/RunLarge@Shape=30,40,30,32,3:PoolType=MAX:PoolingSize=3x3x3:Stride=2x1x1:Padding=0,0,0,0,0,0:ExcludePadding=1:DataType=F32'
Hi @alvoron
The reproducer gives ~6500 microseconds on my M2 Pro, which is twice slower than reference C++ implementation of Pooling.
Can you please point us to the actual reference implementation you're using? How do you make the measurements for both backends reference and ACL? Is it a single binary you're using?
May we refer to benchdnn
results as to reference one? I repeated benchdnn and ACL commands again on Ampere and I got average 2.3-2.6 ms using ACL reproducer and average 0.9 ms using benchdnn.
I assume, my benchdnn command equals to ACL kernel configuration. Please let me know if I missed something.