Depthwise convolution fp16 performance drop
alvoron opened this issue · 4 comments
Output of 'strings libarm_compute.so | grep arm_compute_version':
arm_compute_version=v24.04 Build options: {'neon': '1', 'opencl': '0', 'openmp': '0', 'cppthreads': '1', 'os': 'linux', 'data_layout_support': 'all', 'arch': 'arm64-v8.2-a', 'build': 'native', 'fixed_format_kernels': 'True'} Git hash=b'4fda7a803eaadf00ba36bd532481a33c18952089'
Platform:
Ampere
Operating System:
Ubuntu 22.04.4 LTS
Problem description:
In some cases fp16 convolution takes more time than the same fp32 convolution:
f16 benchdnn reproducer
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g1152mb1_ic1152oc1152_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2
f32 benchdnn reproducer (completely the same set of arguments, dt
differs only)
benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g1152mb1_ic1152oc1152_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2
f16 benchdnn command gives me 0.074-0.079 ms.
f32 benchdnn command gives me 0.045-0.047 ms.
Another reproducer
f16 (avg 0.037 ms)
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g480mb1_ic480oc480_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1
f32 (avg 0.031 ms)
benchdnn --max-ms-per-prb=3e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_B --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user g480mb1_ic480oc480_ih14oh14kh3sh1dh0ph1_iw14ow14kw3sw1dw0pw1
Hi @alvoron
I've noticed your build is using cppthreads=1 openmp=0
, I'd suggest you change to use cppthreads=0 openmp=1
. Can you please try this to see if it helps?
By some reason I can't reproduce my initial results. Now I have the following results:
OMP:
f16 results: 0.067 ms
f32 results: 0.126 ms
cppthreads:
f16 results: 0.076 ms
f32 results: 0.135 ms
TBB:
f16 results: 0.136 ms
f32 results: 0.191 ms
Please let me spend some time to reproduce this issue (if any) again.
I think I'll close this ticket for now. I was able to reproduce this issue with benchdnn, however it is related to TBB - by some reason it gives latency spikes for f16 (latency is in 10-15 times higher than usual value).