NEGEMM needs `configure` calling for each `run`
alvoron opened this issue · 7 comments
It seems that configure
needs to be called per each run
for NEGEMM
.
Because of this issue we have work around in oneDNN fork: openvinotoolkit/oneDNN@a5ca47f
It could be reproduced on ACL 24.08
Reproducer:
#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/utils/misc/MMappedFile.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"
#include "tests/validation/Validation.h"
#include <iostream>
#include <vector>
using namespace arm_compute;
using namespace arm_compute::test;
using namespace arm_compute::test::validation;
void fillTensor(Tensor& t, float min, float max) {
AssetsLibrary library(".", std::random_device()());
std::uniform_real_distribution<> dist{ min, max };
library.fill(Accessor(t), dist, 0);
}
int main(int argc, char *argv[]) {
GEMMInfo gemmInfo;
TensorInfo srcTensorInfo = TensorInfo(TensorShape(8, 1, 1, 4), 1, DataType::F32, DataLayout::NCHW);
TensorInfo weiTensorInfo = TensorInfo(TensorShape(3, 8, 4), 1, DataType::F32, DataLayout::NCHW);
TensorInfo dstTensorInfo = TensorInfo(TensorShape(3, 1, 1, 4), 1, DataType::F32, DataLayout::NCHW);
auto status = NEGEMM::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, 1.0f, 0.0f, gemmInfo);
if(status.error_code() != ErrorCode::OK) {
std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
exit(1);
}
std::cout << "PASSED VALIDATION" << std::endl;
Tensor srcTensor;
Tensor weiTensor;
Tensor dstTensorSingleConfigure;
Tensor dstTensorConfigurePerRun;
srcTensor.allocator()->init(srcTensorInfo);
weiTensor.allocator()->init(weiTensorInfo);
dstTensorSingleConfigure.allocator()->init(dstTensorInfo);
dstTensorConfigurePerRun.allocator()->init(dstTensorInfo);
NEGEMM gemmSingleConfigure;
NEGEMM gemmConfigurePerRun;
gemmSingleConfigure.configure(&srcTensor, &weiTensor, nullptr, &dstTensorSingleConfigure, 1.0f, 0.0f, gemmInfo);
gemmConfigurePerRun.configure(&srcTensor, &weiTensor, nullptr, &dstTensorConfigurePerRun, 1.0f, 0.0f, gemmInfo);
std::cout << "PASSED CONFIGURATION" << std::endl;
srcTensor.allocator()->allocate();
weiTensor.allocator()->allocate();
dstTensorSingleConfigure.allocator()->allocate();
dstTensorConfigurePerRun.allocator()->allocate();
fillTensor(srcTensor, 0.5f, 5.0f);
fillTensor(weiTensor, 5.0f, 10.0f);
gemmSingleConfigure.run();
gemmConfigurePerRun.run();
std::cout << "PASSED RUN #1" << std::endl;
std::cout << "dstTensorSingleConfigure" << std::endl;
dstTensorSingleConfigure.print(std::cout);
std::cout << "dstTensorConfigurePerRun" << std::endl;
dstTensorConfigurePerRun.print(std::cout);
fillTensor(srcTensor, 0.5f, 10.0f);
fillTensor(weiTensor, 5.0f, 20.0f);
gemmConfigurePerRun.configure(&srcTensor, &weiTensor, nullptr, &dstTensorConfigurePerRun, 1.0f, 0.0f, gemmInfo);
gemmSingleConfigure.run();
gemmConfigurePerRun.run();
std::cout << "PASSED RUN #2" << std::endl;
std::cout << "dstTensorSingleConfigure" << std::endl;
dstTensorSingleConfigure.print(std::cout);
std::cout << "dstTensorConfigurePerRun" << std::endl;
dstTensorConfigurePerRun.print(std::cout);
srcTensor.allocator()->free();
weiTensor.allocator()->free();
dstTensorSingleConfigure.allocator()->free();
dstTensorConfigurePerRun.allocator()->free();
return 0;
}
dst tensors dstTensorSingleConfigure
and dstTensorConfigurePerRun
are equal after the 1st run
, but after the 2nd run
they are different.
Do we require to call configure
for each NEGEMM
run?
How ACL was built:
scons arch=armv8.6-a neon=1 os=linux opencl=0 build=native -j 32 debug=1 Werror=false validation_tests=1 fixed_format_kernels=1 multi_isa=1 openmp=0 cppthreads=1 --silent
How to build reproducer:
g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include ~/avoron/acl_gemm.cpp -o bug ./ComputeLibrary/build/tests/AssetsLibrary.o ./ComputeLibrary/build/tests/RawTensor.o -L./ComputeLibrary/build/tests/ -L./ComputeLibrary/build/tests/framework/ -L./ComputeLibrary/build/ -larm_compute_validation_framework -larm_compute_test_framework -larm_compute -std=c++17
Hi @alvoron
There is no need to call configure again if you just update the weights and input.
In your test make sure you set the flag reshape_b_only_on_first_run
to false in GEMMInfo
GEMMInfo gemmInfo
(false, // is_a_reshaped
false, // is_b_reshaped,
false // reshape_b_only_on_first_run,
);
See the result below
[gemm]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test2
PASSED VALIDATION
[ComputeLibrary][12-09-2024 02:16:06][INFO] arm_compute::cpu::CpuGemm::configure() :
a: Shape=8,DataLayout=NCHW,DataType=F32
b: Shape=3,8,4,DataLayout=NCHW,DataType=F32
c: nullptr
d: Shape=3,DataLayout=NCHW,DataType=F32
alpha: 1.000000
beta: 0.000000
gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=0,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}
[ComputeLibrary][12-09-2024 02:16:06][INFO] arm_compute::cpu::CpuGemm::configure() :
a: Shape=8,DataLayout=NCHW,DataType=F32
b: Shape=3,8,4,DataLayout=NCHW,DataType=F32
c: nullptr
d: Shape=3,DataLayout=NCHW,DataType=F32
alpha: 1.000000
beta: 0.000000
gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=0,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}
PASSED CONFIGURATION
PASSED RUN #1
dstTensorSingleConfigure
138.176 119.903 134.778
dstTensorConfigurePerRun
138.176 119.903 134.778
[ComputeLibrary][12-09-2024 02:16:06][INFO] arm_compute::cpu::CpuGemm::configure() :
a: Shape=8,DataLayout=NCHW,DataType=F32
b: Shape=3,8,4,DataLayout=NCHW,DataType=F32
c: nullptr
d: Shape=3,DataLayout=NCHW,DataType=F32
alpha: 1.000000
beta: 0.000000
gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=0,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}
PASSED RUN #2
dstTensorSingleConfigure
423.563 522.41 432.69
dstTensorConfigurePerRun
423.563 522.41 432.69
Hope this helps
@morgolock
Did you modify my reproducer somehow except gemmInfo
initialization?
I've got the error Maximum number of dimensions expected 3 but dimension 3 is not empty
if I initialize gemmInfo
and skip the 2nd configuration in the reproducer.
I also noticed GEMMInfo
is not initialized in oneDNN: https://github.com/oneapi-src/oneDNN/blob/main/src/cpu/aarch64/matmul/acl_matmul_utils.hpp#L66
Does it mean we need to apply the patch to oneDNN to fix this issue with the 2nd configuration?
Hi @alvoron
Did you modify my reproducer somehow except gemmInfo initialization?
I've got the error Maximum number of dimensions expected 3 but dimension 3 is not empty if I initialize gemmInfo and skip the 2nd configuration in the reproducer.
Yes see below, I dropped the 4th dimension.
TensorInfo srcTensorInfo = TensorInfo(TensorShape(8, 1, 1), 1, DataType::F32, DataLayout::NCHW);
TensorInfo weiTensorInfo = TensorInfo(TensorShape(3, 8, 4), 1, DataType::F32, DataLayout::NCHW);
TensorInfo dstTensorInfo = TensorInfo(TensorShape(3, 1, 1), 1, DataType::F32, DataLayout::NCHW);
auto status = NEGEMM::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, 1.0f, 0.0f, gemmInfo);
if(status.error_code() != ErrorCode::OK) {
std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
exit(1);
}
std::cout << "PASSED VALIDATION" << std::endl;
@morgolock in theory, configure
call per inference should not break accuracy, right?
I compared dst
tensors for 2 cases:
configure
per inference +reshape_b_only_on_first_run=true
- the single
configure
+reshape_b_only_on_first_run=false
In these 2 cases I've got different results for the 1st and for the 2nd run:
PASSED RUN #1
dstTensorSingleConfigure
178.938 183.26 180.34
dstTensorConfigurePerRun
157.27 192.243 178.135
PASSED RUN #2
dstTensorSingleConfigure
516.653 679.692 532.997
dstTensorConfigurePerRun
571.231 567.867 578.185
Modified reproducer:
#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/utils/misc/MMappedFile.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
//#include "arm_compute/function_info/GEMMInfo.h"
#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"
#include "tests/validation/Validation.h"
#include <iostream>
#include <vector>
using namespace arm_compute;
using namespace arm_compute::test;
using namespace arm_compute::test::validation;
void fillTensor(Tensor& t, float min, float max) {
AssetsLibrary library(".", std::random_device()());
std::uniform_real_distribution<> dist{ min, max };
library.fill(Accessor(t), dist, 0);
}
int main(int argc, char *argv[]) {
GEMMInfo gemmInfo_reshape_always(false, false, false);
GEMMInfo gemmInfo_reshape_1st_run_only;
TensorInfo srcTensorInfo = TensorInfo(TensorShape(8, 1, 1), 1, DataType::F32, DataLayout::NCHW);
TensorInfo weiTensorInfo = TensorInfo(TensorShape(3, 8, 4), 1, DataType::F32, DataLayout::NCHW);
TensorInfo dstTensorInfo = TensorInfo(TensorShape(3, 1, 1), 1, DataType::F32, DataLayout::NCHW);
auto status = NEGEMM::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, 1.0f, 0.0f, gemmInfo_reshape_1st_run_only);
if(status.error_code() != ErrorCode::OK) {
std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
exit(1);
}
std::cout << "PASSED VALIDATION #1" << std::endl;
status = NEGEMM::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, 1.0f, 0.0f, gemmInfo_reshape_always);
if(status.error_code() != ErrorCode::OK) {
std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
exit(1);
}
std::cout << "PASSED VALIDATION #2" << std::endl;
Tensor srcTensor;
Tensor weiTensor;
Tensor dstTensorSingleConfigure;
Tensor dstTensorConfigurePerRun;
srcTensor.allocator()->init(srcTensorInfo);
weiTensor.allocator()->init(weiTensorInfo);
dstTensorSingleConfigure.allocator()->init(dstTensorInfo);
dstTensorConfigurePerRun.allocator()->init(dstTensorInfo);
NEGEMM gemmSingleConfigure;
NEGEMM gemmConfigurePerRun;
gemmSingleConfigure.configure(&srcTensor, &weiTensor, nullptr, &dstTensorSingleConfigure, 1.0f, 0.0f, gemmInfo_reshape_always);
gemmConfigurePerRun.configure(&srcTensor, &weiTensor, nullptr, &dstTensorConfigurePerRun, 1.0f, 0.0f, gemmInfo_reshape_1st_run_only);
std::cout << "PASSED CONFIGURATION" << std::endl;
srcTensor.allocator()->allocate();
weiTensor.allocator()->allocate();
dstTensorSingleConfigure.allocator()->allocate();
dstTensorConfigurePerRun.allocator()->allocate();
fillTensor(srcTensor, 0.5f, 5.0f);
fillTensor(weiTensor, 5.0f, 10.0f);
gemmSingleConfigure.run();
gemmConfigurePerRun.run();
std::cout << "PASSED RUN #1" << std::endl;
std::cout << "dstTensorSingleConfigure" << std::endl;
dstTensorSingleConfigure.print(std::cout);
std::cout << "dstTensorConfigurePerRun" << std::endl;
dstTensorConfigurePerRun.print(std::cout);
fillTensor(srcTensor, 0.5f, 10.0f);
fillTensor(weiTensor, 5.0f, 20.0f);
gemmConfigurePerRun.configure(&srcTensor, &weiTensor, nullptr, &dstTensorConfigurePerRun, 1.0f, 0.0f, gemmInfo_reshape_1st_run_only);
gemmSingleConfigure.run();
gemmConfigurePerRun.run();
std::cout << "PASSED RUN #2" << std::endl;
std::cout << "dstTensorSingleConfigure" << std::endl;
dstTensorSingleConfigure.print(std::cout);
std::cout << "dstTensorConfigurePerRun" << std::endl;
dstTensorConfigurePerRun.print(std::cout);
srcTensor.allocator()->free();
weiTensor.allocator()->free();
dstTensorSingleConfigure.allocator()->free();
dstTensorConfigurePerRun.allocator()->free();
return 0;
}
May we ask somebody to review oneDNN integration as well?
As I mentioned, GEMMInfo
is not initialized there, so reshape_b_only_on_first_run
is true and it should impact accuracy, since configure is not called per inference in oneDNN.
Hi @alvoron
Please see the modified example below where the results match exactly
#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/utils/misc/MMappedFile.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "tests/Utils.h"
#include "tests/AssetsLibrary.h"
#include "tests/NEON/Accessor.h"
#include "tests/validation/Validation.h"
#include <iostream>
#include <vector>
using namespace arm_compute;
using namespace arm_compute::test;
using namespace arm_compute::test::validation;
void fillTensor(Tensor& t, float min, float max) {
AssetsLibrary library(".", std::random_device()());
std::uniform_real_distribution<> dist{ min, max };
library.fill(Accessor(t), dist, 0);
}
int main(int argc, char *argv[]) {
GEMMInfo gemmInfo
(false, // is_a_reshaped
false, // is_b_reshaped,
false // reshape_b_only_on_first_run,
);
TensorInfo srcTensorInfo = TensorInfo(TensorShape(8, 1, 1), 1, DataType::F32, DataLayout::NCHW);
TensorInfo weiTensorInfo = TensorInfo(TensorShape(3, 8, 4), 1, DataType::F32, DataLayout::NCHW);
TensorInfo dstTensorInfo = TensorInfo(TensorShape(3, 1, 1), 1, DataType::F32, DataLayout::NCHW);
auto status = NEGEMM::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, 1.0f, 0.0f, gemmInfo);
if(status.error_code() != ErrorCode::OK) {
std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
exit(1);
}
std::cout << "PASSED VALIDATION" << std::endl;
Tensor srcTensor;
Tensor weiTensor;
Tensor dstTensorSingleConfigure;
Tensor dstTensorConfigurePerRun;
srcTensor.allocator()->init(srcTensorInfo);
weiTensor.allocator()->init(weiTensorInfo);
dstTensorSingleConfigure.allocator()->init(dstTensorInfo);
dstTensorConfigurePerRun.allocator()->init(dstTensorInfo);
NEGEMM gemmSingleConfigure;
NEGEMM gemmConfigurePerRun;
gemmSingleConfigure.configure(&srcTensor, &weiTensor, nullptr, &dstTensorSingleConfigure, 1.0f, 0.0f, gemmInfo);
gemmConfigurePerRun.configure(&srcTensor, &weiTensor, nullptr, &dstTensorConfigurePerRun, 1.0f, 0.0f, gemmInfo);
std::cout << "PASSED CONFIGURATION" << std::endl;
srcTensor.allocator()->allocate();
weiTensor.allocator()->allocate();
dstTensorSingleConfigure.allocator()->allocate();
dstTensorConfigurePerRun.allocator()->allocate();
fillTensor(srcTensor, 0.5f, 5.0f);
fillTensor(weiTensor, 5.0f, 10.0f);
gemmSingleConfigure.run();
gemmConfigurePerRun.run();
std::cout << "PASSED RUN #1" << std::endl;
std::cout << "dstTensorSingleConfigure" << std::endl;
dstTensorSingleConfigure.print(std::cout);
std::cout << "dstTensorConfigurePerRun" << std::endl;
dstTensorConfigurePerRun.print(std::cout);
fillTensor(srcTensor, 0.5f, 10.0f);
fillTensor(weiTensor, 5.0f, 20.0f);
gemmConfigurePerRun.configure(&srcTensor, &weiTensor, nullptr, &dstTensorConfigurePerRun, 1.0f, 0.0f, gemmInfo);
gemmSingleConfigure.run();
gemmConfigurePerRun.run();
std::cout << "PASSED RUN #2" << std::endl;
std::cout << "dstTensorSingleConfigure" << std::endl;
dstTensorSingleConfigure.print(std::cout);
std::cout << "dstTensorConfigurePerRun" << std::endl;
dstTensorConfigurePerRun.print(std::cout);
srcTensor.allocator()->free();
weiTensor.allocator()->free();
dstTensorSingleConfigure.allocator()->free();
dstTensorConfigurePerRun.allocator()->free();
return 0;
}
And the output
g++ -O2 -g -I../ComputeLibrary -I../ComputeLibrary/include ./test.cpp -o test2 ../ComputeLibrary/build/tests/AssetsLibrary.o ../ComputeLibrary/build/tests/RawTensor.o -L../ComputeLibrary/build/tests/ -L../ComputeLibrary/build/tests/framework/ -L../ComputeLibrary/build/ -larm_compute_validation_framework -larm_compute_test_framework -larm_compute -std=c++17
[pabtel01@ampere-man-01 openvino_gemm]$ LD_LIBRARY_PATH=../ComputeLibrary/build/:$LD_LIBRARY_PATH ./test2
PASSED VALIDATION
[ComputeLibrary][24-09-2024 02:32:46][INFO] arm_compute::cpu::CpuGemm::configure() :
a: Shape=8,DataLayout=NCHW,DataType=F32
b: Shape=3,8,4,DataLayout=NCHW,DataType=F32
c: nullptr
d: Shape=3,DataLayout=NCHW,DataType=F32
alpha: 1.000000
beta: 0.000000
gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=0,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}
[ComputeLibrary][24-09-2024 02:32:46][INFO] arm_compute::cpu::CpuGemm::configure() :
a: Shape=8,DataLayout=NCHW,DataType=F32
b: Shape=3,8,4,DataLayout=NCHW,DataType=F32
c: nullptr
d: Shape=3,DataLayout=NCHW,DataType=F32
alpha: 1.000000
beta: 0.000000
gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=0,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}
PASSED CONFIGURATION
PASSED RUN #1
dstTensorSingleConfigure
142.591 118.595 120.105
dstTensorConfigurePerRun
142.591 118.595 120.105
[ComputeLibrary][24-09-2024 02:32:46][INFO] arm_compute::cpu::CpuGemm::configure() :
a: Shape=8,DataLayout=NCHW,DataType=F32
b: Shape=3,8,4,DataLayout=NCHW,DataType=F32
c: nullptr
d: Shape=3,DataLayout=NCHW,DataType=F32
alpha: 1.000000
beta: 0.000000
gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=0,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}
PASSED RUN #2
dstTensorSingleConfigure
471.933 592.889 518.01
dstTensorConfigurePerRun
471.933 592.889 518.01
May we ask somebody to review oneDNN integration as well?
As I mentioned, GEMMInfo is not initialized there, so reshape_b_only_on_first_run is true and it should impact accuracy, since configure is not called per inference in oneDNN.
We'll have a look at the code in OneDNN.