This is a collection of various functions optimized for armv7 and neon.
- Never return floating point values by value. It would work fine if
-mfloat-abi=hard
was supported everywhere, but sadly it's not. With the more common-mfloat-abi=softfp
, every time you do areturn my_float_value
, it does either afmrs
or avstr
, followed by a load operation in order to read the result back! Instead, use a non-const reference as first parameter. It allows super smooth inlining of your intermediate results without unnecessary loads and stores, just like it would do if hard floats were available (works for vector types too) ! - Try to minimize loads and stores. Though GCC doesn't support evolved
vldmia
/vstmia
and will generate poor code for operations onfloat32x4x4_t
, so handcoding them make sense in that case. - Use vector types everywhere it makes sense. Functions prefixed with
vec3_
andvec4_
directly work onfloat32x4_t
. Those prefixed withmat44_
directly work withfloat32x4x4_t
. Parameters are passed as references, so the compiler doesn't perform unnecessary ARM register transfers. - Don't hard-code registers, but use dummy values instead for clobber, and let the compiler allocate registers as needed.
- A good clobber list is an empty clobber list. If you let the compiler handle loads for you, "memory" shouldn't even show up in your clobber list. The only item that might is "cc".
For best performance I usually use the following CFLAGS: -mthumb -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad -O3 -ffast-math -fomit-frame-pointer -fstrict-aliasing -fgcse-las -funsafe-loop-optimizations -fsee -ftree-vectorize
, with -arch armv7
if it's gcc for iOS or -march=armv7-a
if it's eabi-none-gcc.
Several preprocessor macros, when defined, change the behaviour of the code. See config.h
and config-defaults.h
for details…