jj1bdx/airspy-fmradion

VOLK (libvolk) optimization

jj1bdx opened this issue · 11 comments

More code optimization to fully utilize VOLK (libvolk), not only for ARM NEON, but also for x86.
See #10

VOLK (libvolk) has the following issues:

@bstalk Let us know if you have any tips on using libvolk more effectively for airspy-fmradion. Thx in advance!

libvolk does not have a test for volk_32f_s32f_32f_fm_detect_32f yet. volk_profile does not generate the output for this function.

maybe, need to add following line into lib/kernel_test.h (is it bug?)
QA(VOLK_INIT_TEST(volk_32f_s32f_32f_fm_detect_32f, test_params))
then, cmake ..
make
make test
sudo make install

following a part of volk_profile after patching.
RUN_VOLK_TESTS: volk_32f_s32f_32f_fm_detect_32f(131071,1987)
a_avx completed in 104.64 ms
a_sse completed in 120.689 ms
generic completed in 396.686 ms
u_avx completed in 107.013 ms
Best aligned arch: a_avx
Best unaligned arch: u_avx

@bstalk Thx for the testing result of volk_32f_s32f_32f_fm_detect_32f.
Maybe we need to test the performance increase by adding volk_32f_s32f_32f_fm_detect_32f a_avx u_avx to ~/.volk/volk_config for the x86_64 platforms.

@bstalk
I've found the entry in volk_config for volk_32f_s32f_32f_fm_detect_32f, so I guess the function is already activated (with AVX for x86).

I've also noticed the following line inlib/kernel_tests.h, I don't know what this really does:

QA(VOLK_INIT_PUPP(volk_32f_x2_fm_detectpuppet_32f, volk_32f_s32f_32f_fm_detect_32f, test_params))

As I read source:
test_script -> volk_32f_x2_fm_detectpuppet_32f.sh ... puppet func
volk_config -> volk_32f_s32f_32f_fm_detect_32f ... master_func name of puppet.
Sorry, I don't know further info now.

Tips for optimizing for libvolk:

  • libvolk don't really have 64-bit functions, so focus first on the 32f and 32fc functions.
  • Profile each function first by running volk_profile -b, and put lower priority for the functions which have little speed difference between the generic and optimized (AVX, SSE, NEON) implementations.
  • Use volk::vector for the std::vector as a private member of a class referred from libvolk. Note well, however, that volk::vector does not have the move constructor, so it won't work for *Source drivers.
  • *Always check the integrity when the source and destination of a function operation are to the same memory address (the same vector) at least by referring to the generic implementation.
  • If a set of operation requires the temporary storage of std::vector or volk::vector, the speed gain will be limited.

Note: I've also reviewed libsoxr, and found that libsoxr uses 32- or 64-bit optimizing instruction as default at least for macOS, so I guess no further review of the optimization is required for libsoxr.

VOLK works OK with Xcode 11.3 CLT running on macOS.

@jj1bdx I guess I can conclude the code quality of VOLK support is stabilized so that I can close this issue.