intel/x86-simd-sort

compile and use argsort/argselect with avx512 by g++9/10 is not safe.

Closed this issue · 1 comments

It might not be a bug, but it's better to be a warning for user.

Description

g++-9 with -O2 above and avx512 enabled will compile the argsort_n_vec(of x86-simd-sort) with some mmx instructions (which use registers MM0~MM7 are ususally alias to x87 ST(0) ~ ST(7)), but won't clear x87 fpu state.

So operations on long double, which used x87 instructions default, will perform different result (NAN as mainly).

The worse thing is that long double is used by prime_hashtable_policy which is the default backend of unorderd_map in g++-9.

If we call unordered_map::rehash after argsort/argselect(of x86-simd-sort), rehash(size_t(-1)) was performed, and std::bad_alloc was raised.

Unfortunately g++-9 is end of life. But if someone can upgrade g++ to newer, this problem will disappered.

Workaround

Any one below can work, but also with their own side effect.

  1. compile with -mno-mmx
  2. make long double not to use x87 with -mlong-double-64 or -mlong-double-128

POC

Environment

  1. os: ubuntu 18.04/20.04 with gcc-9.4 or ubuntu 22.04/24.04 with gcc-9.5 or gcc 9.5 docker offical image (it shall not be an os case.)
  2. cpu: Intel 13700H with sde, Intel 11700B, AMD 7950x, Intel Gold Xeon 6145. ( it shall not be a hardware case.)

Code

compile below code with

g++-9 -std=c++17 -Wall -O2 -march=skylake-avx512 -o poc poc.cc
// filename: poc.cc
#include <unordered_map>
#include "x86-simd-sort/src/x86simdsort-static-incl.h"

double calc(size_t size) {
    auto ptr = new double[size]{0};
    auto arg = x86simdsortStatic::argselect(ptr, 0, size);
#ifdef PATCH
    asm volatile ("FNINIT"); // clear x87 fpu state.
#endif
    auto ret = ptr[arg[0]];
    delete[] ptr;
    return ret;
}

int main() {
    std::unordered_map<size_t, size_t> amap;
    amap.clear();
    calc(65); // [65, 256]
    amap.rehash(1); // <- crash here
    return 0;
}

screenshot

poc-issue

simple patch

compile the code with below, and no std::bad_alloc was thrown.

g++-9 -std=c++17 -Wall -O2 -DPATCH -march=skylake-avx512 -o poc poc.cc

gdb script to check x87 fpu is used and the state is not cleared

the ST(0) value should be 1.

gdb \
    -ex "b *$(objdump -d --prefix-address poc | grep fildll | awk '{$2 = substr($2, 2, length($2)-2); print $2}')" \
    -ex 'r' \
    -ex 'print "ST(0) before do size_t to long double convert" ' \
    -ex 'print $st0' \
    -ex 'print "The size_t value to be converted" ' \
    -ex 'print *(size_t *)($rsp+0x8)' \
    -ex 'print "ST(0) after do size_t to long double convert" ' \
    -ex 'si' \
    -ex 'print $st0' \
    --batch --silent ./poc | grep '\$'

screenshot without patch

the ST(0) value is -NAN which means there something wrong with x87 fpu.

poc-gdb

screenshot after patch

the ST(0) value is 1 as it expected to be.

poc-after-patch-gdb

Further reading, what code is genered by g++9, and where is it in argsort.

source code in argsort_n_vec

typename keyType::opmask_t ioMasks[numVecs - numVecs / 2];
X86_SIMD_SORT_UNROLL_LOOP(64)
for (int i = numVecs / 2, j = 0; i < numVecs; i++, j++) {
uint64_t num_to_read
= std::min((uint64_t)std::max(0, N - i * keyType::numlanes),
(uint64_t)keyType::numlanes);

mmx instructions g++-9 generated.

    1bad:       0f ef ed                pxor   %mm5,%mm5
    1bc9:       0f ef f6                pxor   %mm6,%mm6
    1c09:       0f ef e4                pxor   %mm4,%mm4
    1c41:       0f ef db                pxor   %mm3,%mm3
    1c71:       0f ef ff                pxor   %mm7,%mm7
    1c85:       0f ef c9                pxor   %mm1,%mm1
    1c98:       0f ef c0                pxor   %mm0,%mm0
    1ccf:       0f ef d2                pxor   %mm2,%mm2
    db88:       0f ef c0                pxor   %mm0,%mm0
    db9b:       0f ef db                pxor   %mm3,%mm3
    dbae:       0f ef d2                pxor   %mm2,%mm2
    dbc0:       0f ef c9                pxor   %mm1,%mm1

Another poc without unordered_map

#include <cstdio>
#include "x86-simd-sort/src/x86simdsort-static-incl.h"

double calc(size_t size) {
    auto ptr = new double[size]{1};
    auto arg = x86simdsortStatic::argsort(ptr, size, false, true);
#ifdef PATCH
    asm volatile ("FNINIT");
#endif
    auto ret = ptr[arg[0]];
    delete[] ptr;
    return ret;
}

int main() {
    double ret = calc(65); // [65, 256]
    size_t a = 1ull;
    double ret_1 = a / ret;
    long double lret_1 = a / (long double)ret;
    printf("ret=%lf, 1/ret=%lf, 1/(long double)ret=%Lf\n", ret, ret_1, lret_1);
    return std::isnan(lret_1);
}

screenshot

poc2

screenshot after patch

poc2-patch

The code to check that g++-10 is not safe either.

compile g++-10 -std=c++17 -Wall -O2 -march=skylake-avx512 -o poc poc.cc and run ./poc 10

#include <cstdio>
#include "x86-simd-sort/src/x86simdsort-static-incl.h"

double calc(size_t size) {
    auto ptr = new double[size]{1};
    auto arg = x86simdsortStatic::argsort(ptr, size, false, true);
#ifdef PATCH
    // asm volatile ("FNINIT");
    asm volatile ("EMMS");
#endif
    auto ret = ptr[arg[0]];
    delete[] ptr;
    return ret;
}

int main(int argc, const char * argv[]) {
    double ret = calc(256); // [65, 256]
    size_t a = std::atoll(argv[1]);
    double ret_a = a / ret;
    long double lret_a = a / (long double)ret;
    printf("ret=%lf, %lu/ret=%lf, %lu/(long double)ret=%Lf\n", ret, a, ret_a, a, lret_a);
    return std::isnan(lret_a);
}

thanks for letting us know. Probably good to a add note in the README section to build using -mno-mmx flags when using g++ v9. Would you like to add a PR with that note?