AnyDSL/stincilla

Performance Regression?

jachris opened this issue · 2 comments

I want to compare AnyDSL and Halide to another approach. However, there seem to be some performance regressions. I initially noticed this on an AnyDSL setup from a few months ago, but the problem is also present when running the newest version (as of today).

Specifically, I am currently running the halide/blur.impala benchmark on dedicated server with an Intel i9-10900K. The blur benchmark (with 2000 iterations) yields

read_pnm_image('lena.pgm', P5, 2048x2048 [0..213]|255|)
Timing: 3.059 | 2.802 | 7.274 (median(2000) | minimum | maximum) ms
Total timing for cpu / kernel: 3.059 / 0 ms
write_pnm_image('lena_out.pgm', P2, 2048x2048 [0..210])

This translates to about 1.4 Gpixel/s, significantly lower than the 1.9 Gpixel/s from this paper, even though it's on a better machine. Benchmarking Halide with the same blur, schedule and image yields

min = 0.864433ms
max = 1.99026ms
med = 0.900315ms
avg = 0.913448ms

which is about 4.6 Gpixel/s. Maybe Halide was improved in the meantime, but I still find this difference pretty surprising. My own approach processes around 3.0 Gpixel/s, even though it's not (yet) vectorized optimally. So, to me it seems like something is going wrong with the compilation of the Stincilla code.

Here is the log produced by Make, which shows that I'm compiling with optimizations:

cd /home/user/anydsl/stincilla/build && make -f CMakeFiles/Makefile2 halide/CMakeFiles/blur.dir/rule
make[1]: Entering directory '/home/user/anydsl/stincilla/build'
/usr/bin/cmake -S/home/user/anydsl/stincilla -B/home/user/anydsl/stincilla/build --check-build-system CMakeFiles/Makefile.cmake 0
Re-run cmake file: Makefile older than: CMakeCache.txt
-- Found llvm-as: /home/user/anydsl/llvm_install/bin/llvm-as
-- Note: llvm-as version needs to match the required LLVM bitcode version from NVVM.
-- Found opt: /home/user/anydsl/llvm_install/bin/opt
-- AnyDSL debug flags (Debug): -g
-- AnyDSL release flags (Release): -O3
-- Selected backend: avx
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/anydsl/stincilla/build
/usr/bin/cmake -E cmake_progress_start /home/user/anydsl/stincilla/build/CMakeFiles 4
make -f CMakeFiles/Makefile2 halide/CMakeFiles/blur.dir/all
make[2]: Entering directory '/home/user/anydsl/stincilla/build'
make -f halide/CMakeFiles/blur.dir/build.make halide/CMakeFiles/blur.dir/depend
make[3]: Entering directory '/home/user/anydsl/stincilla/build'
[ 25%] Generating blur.ll
cd /home/user/anydsl/stincilla/build/halide && /home/user/anydsl/artic/build/bin/artic /home/user/anydsl/runtime/platforms/artic/intrinsics.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_rv.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_cpu.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_hls.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_cuda.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_nvvm.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_amdgpu.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_opencl.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_thorin.impala /home/user/anydsl/runtime/platforms/artic/runtime.impala /home/user/anydsl/runtime/platforms/artic/intrinsics_math.impala /home/user/anydsl/stincilla/backend_avx.impala /home/user/anydsl/stincilla/utils.impala /home/user/anydsl/stincilla/halide/dummy_gpu.impala /home/user/anydsl/stincilla/halide/schedule.impala /home/user/anydsl/stincilla/halide/blur.impala --log-level info --max-errors 5 -O3 --emit-llvm -o /home/user/anydsl/stincilla/build/halide/./blur
cd /home/user/anydsl/stincilla/build/halide && /usr/bin/python3.8 /home/user/anydsl/runtime/post-patcher.py /home/user/anydsl/stincilla/build/halide/./blur
cd /home/user/anydsl/stincilla/build/halide && /usr/bin/cmake -D_basename=blur -DLLVM_AS_BIN=/home/user/anydsl/llvm_install/bin/llvm-as -P /home/user/anydsl/runtime/cmake/check_nvvmir.cmake
[ 50%] Generating blur.o
cd /home/user/anydsl/stincilla/build/halide && /home/user/anydsl/llvm_install/bin/clang-12 -march=native -O3 -fPIE -c -o /home/user/anydsl/stincilla/build/halide/./blur.o /home/user/anydsl/stincilla/build/halide/./blur.ll
cd /home/user/anydsl/stincilla/build && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /home/user/anydsl/stincilla /home/user/anydsl/stincilla/halide /home/user/anydsl/stincilla/build /home/user/anydsl/stincilla/build/halide /home/user/anydsl/stincilla/build/halide/CMakeFiles/blur.dir/DependInfo.cmake --color=
make[3]: Leaving directory '/home/user/anydsl/stincilla/build'
make -f halide/CMakeFiles/blur.dir/build.make halide/CMakeFiles/blur.dir/build
make[3]: Entering directory '/home/user/anydsl/stincilla/build'
[ 75%] Linking CXX executable blur
cd /home/user/anydsl/stincilla/build/halide && /usr/bin/cmake -E cmake_link_script CMakeFiles/blur.dir/link.txt --verbose=1
/usr/bin/c++  -O3 -DNDEBUG   CMakeFiles/blur.dir/main.cpp.o blur.o  -o blur  -Wl,-rpath,/home/user/anydsl/runtime/build/lib:/home/user/anydsl/llvm_install/lib /home/user/anydsl/runtime/build/lib/libruntime.so -Wl,-rpath-link,/home/user/anydsl/llvm_install/lib
make[3]: Leaving directory '/home/user/anydsl/stincilla/build'
[100%] Built target blur
make[2]: Leaving directory '/home/user/anydsl/stincilla/build'
/usr/bin/cmake -E cmake_progress_start /home/user/anydsl/stincilla/build/CMakeFiles 0
make[1]: Leaving directory '/home/user/anydsl/stincilla/build'

Here is the produced LLVM-IR: blur.ll

It would be great if you could help me find the issue here. I know that there is not much development around Stincilla anymore, but this might also effect other projects using the AnyDSL framework (or not, I don't know).

Thank you.

Did you figure out what caused this ?

Some of this was because CMake was not able to find TBB (perhaps this should emit a warning). Still, my last measurements indicated that Halide is 9% faster for 4096x4096 images (49% faster for 2048x2048, 25% slower for 8192x8192). However, this could also be because I am using a much more recent Halide version that the original AnyDSL Paper. Thus, I am not sure anymore whether there actually is a performance regression.