NEDeconvolutionLayer f16 performance issue
alvoron opened this issue · 4 comments
NEDeconvolutionLayer run()
with f16 tensors takes more time than NEDeconvolutionLayer run()
with f32 tensors.
On Ampere f32 version takes ~66 milliseconds, f16 version ~80 milliseconds.
ACL build command:
scons arch=armv8.6-a neon=1 os=linux opencl=0 build=native -j 64 Werror=false validation_tests=1 fixed_format_kernels=1 multi_isa=1 openmp=0 cppthreads=1
Reproducer build command
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include ~/avoron/acl_deconv.cpp -o bug -L./ComputeLibrary/build/ -larm_compute ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o ./ComputeLibrary/build/tests/framework/Exceptions.o -std=c++17
Reproducer run commands:
LD_LIBRARY_PATH=ComputeLibrary/build ./bug
LD_LIBRARY_PATH=ComputeLibrary/build ./bug 1
The 1st command uses f32 tensors, the 2nd one - f16 tensors.
Reproducer:
#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/NEON/Accessor.h"
#include "tests/AssetsLibrary.h"
#include <iostream>
#include <vector>
#include <chrono>
using namespace arm_compute;
using namespace arm_compute::test;
int main(int argc, char *argv[]) {
PadStrideInfo deconv_info = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);
//f32 if no argument passed; f16 if any argument passed
DataType dt = (argc == 1) ? DataType::F32 : DataType::F16;
TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 640, 360, 1), 1, dt, DataLayout::NHWC);
TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, dt, DataLayout::NHWC);
TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 1920, 1080, 1), 1, dt, DataLayout::NHWC);
auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconv_info);
if(status.error_code() != ErrorCode::OK) {
std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
exit(1);
}
std::cout << "PASSED VALIDATION" << std::endl;
Tensor srcTensor;
Tensor weiTensor;
Tensor dstTensor;
srcTensor.allocator()->init(srcTensorInfo);
weiTensor.allocator()->init(weiTensorInfo);
dstTensor.allocator()->init(dstTensorInfo);
NEDeconvolutionLayer deconv;
deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info);
std::cout << "PASSED CONFIGURATION" << std::endl;
srcTensor.allocator()->allocate();
weiTensor.allocator()->allocate();
dstTensor.allocator()->allocate();
AssetsLibrary library(".", std::random_device()());
std::uniform_real_distribution<> distribution{ 0.0f, 100.0f };
library.fill(Accessor(srcTensor), distribution, 0);
library.fill(Accessor(weiTensor), distribution, 0);
//warm-up
deconv.run();
std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; i++) deconv.run();
std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
std::cout << "PASSED RUN: " << std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count() / 100 << std::endl;
srcTensor.allocator()->free();
weiTensor.allocator()->free();
dstTensor.allocator()->free();
return 0;
}
Hi @alvoron
Thanks. I can reproduce the problem. FP32 performance for this specific configuration is better than FP16. It will require further investigation.
Hi @alvoron
The following patch solves the problem.
Make sure that in your test you enable fast_math
when calling NEDeconvolutionLayer::configure()
See below the following change in your test
NEDeconvolutionLayer deconv;
deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconv_info, /* enable fast match */ true);
std::cout << "PASSED CONFIGURATION" << std::endl;
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test 1
F16
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 151639
[user@test_deconv]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test
F32
PASSED VALIDATION
PASSED CONFIGURATION
PASSED RUN: 221537
Hope this helps.
@morgolock thank you for the patch, it works for me as well.
Although, my diff between f32 and f16 is not so high as yours - I have 65-67 ms on f32 and 60-62 ms on f16.
What machine was used to get results you shared above?
Hi @alvoron
I ran this on Neoverse N1.
I built the library with cons -j32 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 validation_tests=1 os=linux arch=armv8a build=native multi_isa=1 fixed_format_kernels=1 openmp=1 cppthreads=0 asserts=0 logging=0 -j8
Make sure you use openmp=1 cppthreads=0
Hope this helps