Regarding Winograd Convolution accuracy
Closed this issue · 19 comments
Hello,
- Can I know the source files where winograd convolution code is implemented?
- Also is there scope/plan for accuracy improvement of winograd convolution?
Hi @vineel96
CpuWinogradConv2d
is the high level operator that drives the computation, you can see the implemented in this file
The lower level kernels are implemented in the files in the folder
For the GpuBackend, the high level operator see ClWinogradConv2d and the kernels
ClWinogradFilterTransformKernel
ClWinogradInputTransformKernel
ClWinogradOutputTransformKernel
See the actual opencl kernels in
winograd_filter_transform
winograd_input_transform
winograd_input_transform
Do you have a use case where the accuracy is not good enough?
Hope this helps
Hello @morgolock ,
Thanks for the reply.
If we use winograd convolution from oneDNN library, there are several test cases that are failing.
Below is two such failures from oneDNN logs:
Here shape is 1x256x8x25, exp_f32 is expected value and got is what it got as output. diff says difference between original and obtained value which is inaccuracy. Errors:48652 out of total:51200 in the input
Example1:
- Is there any reason why this inaccurate values are obtained? Can we point out which part of algorithm is causing this error? or its from coding part?
- Also is there any scope for improvement of accuracy in these cases?
This seems like a test failure in oneDNN's implementation of winograd when dispatched to ACL. Can you please let me know how you built oneDNN, ACL and the test command you run to see this issue, and confirm the version (or commit hashes) of these libraries you are using? Does this happen with latest main commit of oneDNN and the latest ACL tag? Also, please let me know what system you are running this on.
For the implementation in oneDNN, you can see:
src/cpu/aarch64/acl_winograd_convolution.hpp
src/cpu/aarch64/acl_winograd_convolution.cpp
For the implementation in ACL, you can see:
arm_compute/runtime/experimental/operators/CpuWinogradConv2d.h
src/runtime/experimental/operators/CpuWinogradConv2d.cpp
Hi @theComputeKid,
Thanks for the reply.
oneDNN version: v3.4.4
ACL version: v23.11
System: AWS graviton instance
I dint test with latest ACL and oneDNN versions.
oneDNN build instructions:
- git clone https://github.com/oneapi-src/oneDNN.git
- git checkout v3.4.4 && mkdir build && cd build
- export ACL_ROOT_DIR=/pathto/ComputeLibrary
- cmake .. -DDNNL_AARCH64_USE_ACL=ON
- make -j
ACL build instructions:
- git clone https://github.com/ARM-software/ComputeLibrary.git
- cd ComputeLibrary
- git Checkout v23.11
- scons arch=armv8.6-a-sve neon=1 os=linux opencl=0 build=native -j 32 Werror=false validation_tests=0 fixed_format_kernels=1 multi_isa=1 openmp=1 cppthreads=0
Command to get the above output:
export ONEDNN_VERBOSE=1
./benchdnn --conv --dir=FWD_D --batch=inputs/conv/test_conv_all
The reason for accuracy issue is due to implementation in oneDNN or because of implementation in ACL library or the algorithm itself?
Thank you. You have provided a batch call to benchdnn, this will run all tests in the convolution suite. Can you please paste the specific benchdnn test case that fails? The failure line will contain a "REPRO" (you can see in your image at the bottom).
@theComputeKid,
the command is:
1.) ./benchdnn --conv --skip-impl=ref --allow-enum-tags-only=false --dir=FWD_I --dt=f16:f16:f16 --attr-post-ops=relu:0.271:0.314 mb1ic256ih8iw25oc256oh8ow25kh3kw3ph1pw1n"88410371d538e3c30cb1f077d487f7715&f8b6fef926b401fccb8ef229fcd9e8635"
2.) ./benchdnn --conv --skip-impl=ref --allow-enum-tags-only=false --dir=FWD_I --dt=f16:f16:f16 --attr-post-ops=relu:0.271:0.314 mb1ic128ih8iw25oc256oh8ow25kh3kw3ph1pw1n"7d634e83b16776944f6630d5d375a2725&f361b13cd148c1f4a0909e579777ddeb5"
Thanks. I tried your repro on both latest ACL/oneDNN as well as ACL v23.11 and oneDNN v3.4.4 with your build commands, and I notice in both cases that winograd convolution is not called here, and also, both tests pass:
ONEDNN_VERBOSE=1 ./benchdnn --conv --skip-impl=ref --allow-enum-tags-only=false --dir=FWD_I --dt=f16:f16:f16 --attr-post-ops=relu:0.271:0.314 mb1ic128ih8iw25oc256oh8ow25kh3kw3ph1pw1n"7d634e83b16776944f6630d5d375a2725&f361b13cd148c
1f4a0909e579777ddeb5"
onednn_verbose,info,oneDNN v3.4.4 (commit b2b7605d62f5316b436073952fff84d55e49efc6)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:acdb::f0,,,1x256x8x25,2.93896
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:Acdb16a::f0,,,256x128x3x3,0.226074
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:Acdb8a::f0,,,256x128x3x3,0.0419922
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:acdb::f0,,,1x128x8x25,0.0568848
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:acdb::f0,,,1x128x8x25,0.0239258
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f16:a:blocked:acdb::f0 wei_f16:a:blocked:Acdb16a::f0 bia_undef::undef::: dst_f16:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0.271:0.314 ,alg:convolution_direct,mb1_ic128oc256_ih8oh8kh3sh1dh0ph1_iw25ow25kw3sw1dw0pw1,0.617188
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb8a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0.271:0.314 ,alg:convolution_direct,mb1_ic128oc256_ih8oh8kh3sh1dh0ph1_iw25ow25kw3sw1dw0pw1,0.705811
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f16::blocked:acdb::f0 dst_f32::blocked:abcd::f0,,,1x256x8x25,0.0529785
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:acdb::f0 dst_f32::blocked:abcd::f0,,,1x256x8x25,0.0410156
0:PASSED __REPRO: --conv --skip-impl=ref --allow-enum-tags-only=false --dir=FWD_I --dt=f16:f16:f16 --attr-post-ops=relu:0.271:0.314 mb1ic128ih8iw25oc256oh8ow25kh3kw3ph1pw1n7d634e83b16776944f6630d5d375a2725&f361b13cd148c1f4a0909e579777ddeb5
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.01s; fill: 0.01s (53%); compute_ref: 0.00s (7%); compare: 0.00s (4%);
And for the other one:
ONEDNN_VERBOSE=1 ./benchdnn --conv --skip-impl=ref --allow-enum-tags-only=false --dir=FWD_I --dt=f16:f16:f16 --attr-post-ops=relu:0.271:0.314 mb1ic256ih8iw25oc256oh8ow25kh3kw3ph1pw1n"88410371d538e3c30cb1f077d487f7715&f8b6fef926b401fccb8ef229fcd9e8635"
onednn_verbose,info,oneDNN v3.4.4 (commit b2b7605d62f5316b436073952fff84d55e49efc6)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:acdb::f0,,,1x256x8x25,3.13696
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:Acdb16a::f0,,,256x256x3x3,0.528076
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:Acdb8a::f0,,,256x256x3x3,0.0500488
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f16::blocked:acdb::f0,,,1x256x8x25,0.0432129
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:acdb::f0,,,1x256x8x25,0.0209961
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f16:a:blocked:acdb::f0 wei_f16:a:blocked:Acdb16a::f0 bia_undef::undef::: dst_f16:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0.271:0.314 ,alg:convolution_direct,mb1_ic256oc256_ih8oh8kh3sh1dh0ph1_iw25ow25kw3sw1dw0pw1,0.72998
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb8a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0.271:0.314 ,alg:convolution_direct,mb1_ic256oc256_ih8oh8kh3sh1dh0ph1_iw25ow25kw3sw1dw0pw1,0.36499
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f16::blocked:acdb::f0 dst_f32::blocked:abcd::f0,,,1x256x8x25,0.0759277
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:acdb::f0 dst_f32::blocked:abcd::f0,,,1x256x8x25,0.0429688
0:PASSED __REPRO: --conv --skip-impl=ref --allow-enum-tags-only=false --dir=FWD_I --dt=f16:f16:f16 --attr-post-ops=relu:0.271:0.314 mb1ic256ih8iw25oc256oh8ow25kh3kw3ph1pw1n88410371d538e3c30cb1f077d487f7715&f8b6fef926b401fccb8ef229fcd9e8635
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.01s; fill: 0.01s (54%); compute_ref: 0.00s (4%); compare: 0.00s (6%);
This also passes in our testing pipelines. Can you please share your equivalent log with me so I can compare?
@theComputeKid ,
you need to force path to winograd acl, for that you need to make some changes in the following files:
These are done in oneDNN v3.4.4 , checkout to this version and do these changes.
-
In oneDNN/src/cpu/cpu_convolution_list.cpp,
comment all CPU_INSTANCE_IMPL which are present above CPU_INSTANCE_AARCH64_ACL(acl_wino_convolution_fwd_t) in the file. (search for all its occureences) -
In oneDNN/src/cpu/aarch64/acl_winograd_convolution.hpp , comment the following line 87 and add the following:
// bool ok = is_fwd()
// && utils::one_of(desc()->alg_kind,
// alg_kind::convolution_auto,
// alg_kind::convolution_winograd)
// && utils::one_of(true, is_fp16_ok, is_fp32_ok)
// && !has_zero_dim_memory();add the following at same place :
bool ok = is_fwd()
&& utils::one_of(true, is_fp16_ok, is_fp32_ok)
&& !has_zero_dim_memory();
Now check the logs for winograd acl.
Is it passing with wino acl in your pipeline i.e is it in acl github ?
Thank you. I'll investigate further before getting back to you.
In the meantime, can you please let me know your use case that requires you to override oneDNN's algorithm selection and force winograd, rather than use whatever algorithm oneDNN determines is the best to use?
@theComputeKid,
Since winograd is faster compared to other algorithms, thought of using it instead of others for faster executions.
Since winograd is faster compared to other algorithms, thought of using it instead of others for faster executions.
That's not always the case, it depends on the configuration of the workload. In ACL we have an heuristic in CpuConv2d to select the best convolution method for the given workload.. For most workloads the best option is to use CpuConv2d
and let it do its work.
Hope this helps
@morgolock and @theComputeKid,
But there are some shapes for which winograd is faster(when we force winograd) than normal convolution (gemm acl). But the only issue we see is accuracy mismatch in winograd. This is the reason, though of using winograd convolution in cases where its faster than others but accuracy mismatch is observed.
Hi @vineel96
But there are some shapes for which winograd is faster(when we force winograd) than normal convolution (gemm acl). But the only issue we see is accuracy mismatch in winograd.
How much faster ? Could you please provide greater detail as to the actual speedup we are talking about?
We find that the accuracy of our winograd implementation is good for all major use cases.
Hope this helps
@morgolock,
I observed that:
- If I use benchdnn command without --alg=WINO, I get accuracy mismatch and I find this in the log as:
onednn_verbose,primitive,exec,cpu,convolution,wino:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb8a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0.271:0.314 ,alg:convolution_direct,mb1_ic128oc256_ih8oh8kh3sh1dh0ph1_iw25ow25kw3sw1dw0pw1,0.705811
Here algo called is convolution_direct and primitive called is wino:acl.
- If I use benchdnn command with --alg=WINO, I dint get any accuracy mismatch and I find this in the log as:
onednn_verbose,primitive,exec,cpu,convolution,wino:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb8a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-post-ops:eltwise_relu:0.271:0.314 ,alg:convolution_winograd,mb1_ic128oc256_ih8oh8kh3sh1dh0ph1_iw25ow25kw3sw1dw0pw1,0.705811
Here algo called is convolution_winograd and primitive called as wino:acl . So here winograd convolution algorithm is called according to log. But in case 1 its going into convolution_direct with wino:acl as primitive which might be reason for accuracy issue.
Hi @vineel96
Thanks, in both cases above we see 0.705811
. Is this so?
It seems there is no significant speedup.
@morgolock,
While logging I dint concentrate on time, so dint do on performance machine, there are other processes running, so not fair comparison.
Coming to accuracy issue:
- When I use --alg=WINO, I see a threshold is been set as 2e-5 for accuracy checking in this file(https://github.com/oneapi-src/oneDNN/blob/main/tests/benchdnn/conv/conv.cpp#L429) . This is the reason the test case is passing though its giving wrong outputs.
- When I dont use --alg=WINO, no threshold is set that is the reason test case is failing as there is output mismatch.
Are you saying that 2e-5 is not enough accuracy for Winograd for fp32?
@theComputeKid ,
It is ok for some cases, but even more cases are there where the error difference is at 2 decimal point also(like 0.0x or 0.x), almost all values in matrix are getting this much error difference.
Closing this as our implementation passes all major test suites. We think the accuracy in Winograd is good enough.