Slowdown on several vector variations

Question

Slowdown on several vector variations

TheTryton opened this issue 4 years ago · 2 comments

[MSVC] (CPU: Xeon 1231v3)
I noticed slowdown when I was experimenting with several combinations of matrix sizes and simdpp instruction sets.
I implemented matrix multiplication in 3 different ways: plain C, unmodified simdpp, own modification of simdpp. All implementations resemble code below:

matrix<float, 4, 4> result;
    __m128 m2rows[4] =
    {
        _mm_load_ps(m2.row(0).data()),
        _mm_load_ps(m2.row(1).data()),
        _mm_load_ps(m2.row(2).data()),
        _mm_load_ps(m2.row(3).data()),
    };

    auto m1row = m1.begin();
    auto resultrow = result.begin();
    for (;m1row != m1.end(); m1row += 4, resultrow += 4)
    {
        __m128 m1row_v[4] =
        {
            _mm_load_ps1(m1row),
            _mm_load_ps1(m1row + 1),
            _mm_load_ps1(m1row + 2),
            _mm_load_ps1(m1row + 3)
        };

        __m128 temp_mul[4] =
        {
            _mm_mul_ps(m1row_v[0], m2rows[0]),
            _mm_mul_ps(m1row_v[1], m2rows[1]),
            _mm_mul_ps(m1row_v[2], m2rows[2]),
            _mm_mul_ps(m1row_v[3], m2rows[3]),
        };

        __m128 resultrow_v = _mm_add_ps(_mm_add_ps(temp_mul[0], temp_mul[1]), _mm_add_ps(temp_mul[2], temp_mul[3]));
        _mm_store_ps(resultrow, resultrow_v);
    }

    return result;

(Every structure is properly aligned and has valid size in order to load/store specific simd type)
Below I will explain performance problems caused in each specific combination of matrix size and instruction set selected:
Performance calculated as an average of 100000 iterations and 10 runs of every implementations.
Compiled with /O2 and /Ob2 optimizations (Release).

matrix 4x4 (SSE2):
plain C: ~9ns
simdpp: ~ 9ns
personal modification simdpp: ~ 9ns
Here everything works as expected
matrix 8x8 (SSE2):
plain C: ~63ns
simdpp : ~ 377ns
personal modification simdpp: ~ 64ns
In this combination we can see performance loss on simdpp side. After analysing implementation my solution was to change line where actual load was actually taking place - i_load(V&a, const char* p) template function. In this function MSVC compiler couldn't simplify the for loop written inside -> after change to constexpr offset added to pointer (std::index_sequence trick) execution time decreased and was matching the one from plain C (modification made in personal modification of simdpp)
matrix 16x16 (SSE2):
plain C: ~272ns
simdpp: ~6532ns
personal modification simdpp: ~ 270ns
This combination presents the same problem as previous one.
matrix 4x4 (AVX2):
plain C (__m128 falloff): ~9ns
simdpp (__m128 falloff): ~ 593ns
personal modification simdpp (__m128 falloff): ~ 9ns
Even though nothing really happened (still using __m128 internally as in SSE2 matrix 4x4 case) simdpp performance dropped significantly. After digging through implementation and Intel Intrinsics Guide I found a solution - replace _mm_broadcast_ss with _mm_load_ss and permute4<0,0,0,0>(v) as if I didn't define AVX2 instruction set. In Intel Intrinsics Guide we can read that _mm_broadcast_ss happens to have latency of 6 while having throughput of 0.5 CPI which means that for four load_splat I'm doing CPU is waiting while doing nothing for about 22 cycles. With using _mm_load_ss and permute4<0,0,0,0>(v) (which is almost always equivalent to _mm_load_ps1) CPU doesn't waste these cycles (this change translates to instructions movss and shufps which both have throughput of 1 CPI and latency of 1 cycle).
matrix8x8 (AVX2):
plain C (__m256): ~33ns
simdpp (__m256): ~ 32ns
personal modification simdpp (__m256): ~ 34ns
As expected simdpp performance matches expectations.
matrix16x16 (AVX2):
plain C (__m256): ~260ns
simdpp (__m256): ~ 2915ns
personal modification simdpp (__m256): ~ 257ns
Again simdpp creates a performance hit by using for loop as explained in the second case.

Problems stated above are more common in simdpp implementation (eg. add, mul, sub functions). I haven't tested above code using G++ and Clang on my platform but I suppose some of these problems can still happen (eg. the one with _mm_broadcast_ss). Also I haven't tested performance of these matrix multiplications on AVX512F because I don't have a CPU supporting that instruction set but some of issues of similar kind may be present with use of this instruction set.

Answer 1 · 2020-07-16T09:07:08.000Z

It would be interesting to reproduce this benchmark with gcc and clang.

Answer 2 · 2020-09-27T00:58:13.000Z

It would be interesting to reproduce this benchmark with gcc and clang.

I've tested them now and it seems that these slowdowns are not present on both gcc (9.3.0) and clang (9.0.1). From my investigation those slowdowns are mainly due to MSVC inferior optimization.