intel/x86-simd-sort

Can I use this lib to output the original index of these sorted data

Opened this issue · 6 comments

This lib looks awesome!
But I have a requirement to output the original index of these sorted data, for example the input array is {300 200 400 99 150 50 230 250}, then the sorted result should be {50 99 150 200 230 250 300 400}, and I want to get the original index as output {5, 3, 4, 1, 6, 7, 0, 2}.
Currently, I use the std::sort with a lambda to implement this require, but the performance is not good.

Hi, If you pass in the array and array of indices initialized to [0, 1, 2, 3, .. n] as arguments for key-value pair, PR #2 can be adapted to what you want. We have to add benchmarks to compare this to your implementation of std::sort to know how much we actually benefit from this.

Hi, If you pass in the array and array of indices initialized to [0, 1, 2, 3, .. n] as arguments for key-value pair, PR #2 can be adapted to what you want. We have to add benchmarks to compare this to your implementation of std::sort to know how much we actually benefit from this.

Thank you very much for this useful information and your contribute to this lib, PR #2 works well in my project, there's only one issue that avx512_qsort_kv(float*keys, uint16_t *indexes, int64_t arrsize) is not support yet, performance of current avx512_qsort_kv(double *keys, uint64_t *indexes, int64_t arrsize) is a little worse than my simple bubble sorting.

Hope that avx512_qsort_kv(float*keys, uint16_t *indexes, int64_t arrsize) will be much better than my current implementation.

Thanks again!

avx512_qsort_kv(double *keys, uint64_t *indexes, int64_t arrsize) is a little worse than my simple bubble sorting.

@xiangyunzhou do you mind sharing your bubble sort so we can compare and add it to benchmarks? If a simple bubble sort performs as much, then what would be the point of all the complicated code.

@r-devulap below is my implementation and UT in my project. I want to get the top 32 index of 400 sorting by a float pfWeight. icpx with flags: -march=icelake-server and -O3.
Because avx512_qsort_kv doesn't support avx512_qsort_kv(float*keys, uint16_t *indexes, int64_t arrsize), so I need to store the key in u64 and value in double, which is not fair to compare with bubble, but it is about 10 times better than std::sort

#define MAX_CS1_NUM                       (400)
#define MAX_CS2_NUM                       (32)

uint32_t find_index_of_min(uint32_t num, const uint16_t *indexList, const float *valueList)
{
    uint32_t idx   = 0;
    float minValue = valueList[indexList[0]];
    for (uint32_t i = 1; i < num; i++)
    {
        const float value = valueList[indexList[i]];
        if(value < minValue)
        {
            minValue = value;
            idx      = i;
        }
    }
    return idx;
}

uint32_t mac_cell_select_cs2_by_pfweight(uint32_t numCs1, uint16_t *ueindexCs1, uint16_t *ueindexCs2, const float *pfWeight)
{
    uint32_t numCs2 = 0;
    while (numCs1 > 0 && numCs2 < MAX_CS2_NUM)
    {
        const uint32_t idx   = find_index_of_min(numCs1, ueindexCs1, pfWeight);
        assert(idx < MAX_CS1_NUM);
        ueindexCs2[numCs2++] = ueindexCs1[idx];
        ueindexCs1[idx]      = ueindexCs1[numCs1 - 1];
        numCs1--;
    }
    return numCs2;
}

TEST_F(MacCell, case_mac_cell_select_cs2_by_pfweight)
{
    float ueAdr[MAX_NUM_UE];
    float ueSe[MAX_NUM_UE];
    float uePfWeight[MAX_NUM_UE]={0.0};
    double uePfWeight1[MAX_NUM_UE]={0.0};

    for (uint32_t i = 0; i < MAX_NUM_UE; i++)
    {
        ueAdr[i] = ((float)rand())/13;
        ueSe[i] = ((float)rand())/13;
    }
    mac_cell_cal_ue_pfweight(MAX_NUM_UE, ueAdr, ueSe, uePfWeight);

    const uint32_t numCs1 = MAX_CS1_NUM;
    uint16_t ueindexCs1Input[MAX_CS1_NUM], ueindexCs2[MAX_CS2_NUM];
    uint64_t ueindexCs1[MAX_CS1_NUM];
    for (uint32_t i = 0; i < numCs1; i++)
    {
        ueindexCs1[i] = i;
        ueindexCs1Input[i] = i;
        uePfWeight1[i] = uePfWeight[i];
    }
    //memcpy(ueindexCs1Input, ueindexCs1, sizeof(ueindexCs1Input));

    const uint64_t functionStart = timer_get_ticks();
    const uint32_t numCs2 = mac_cell_select_cs2_by_pfweight(numCs1, ueindexCs1Input, ueindexCs2, uePfWeight);
    const uint64_t functionEnd = timer_get_ticks();
    std::sort(ueindexCs1, ueindexCs1 + numCs1, [uePfWeight1](uint16_t x, uint16_t y){return uePfWeight1[x] < uePfWeight1[y];});
    //avx512_qsort_kv(uePfWeight1, ueindexCs1, numCs1);
    const uint64_t sortEnd = timer_get_ticks();

    printf("mac_cell_select_cs2_by_pfweight time:%lu, std::sort time:%lu\n", functionEnd - functionStart, sortEnd - functionEnd);
    ASSERT_LT(functionEnd - functionStart, sortEnd - functionEnd);
    ASSERT_EQ(numCs2, MAX_CS2_NUM);
    for (uint32_t i = 0; i < numCs2; i++)
    {
        ASSERT_EQ(ueindexCs2[i], ueindexCs1[i]);
    }
}

I want to get the top 32 index of 400 sorting by a float pfWeight.

Have you tried benchmarking for a larger array? The code size of AVX512 qsort is fairly large when compared to std::sort or your bubble sort. If you are sorting an array of just 400 elements, there might be a big front end penalty for AVX512 qsort which might offset all benefits you get from it (not sure if this is true but just wondering if that is why you are not seeing any perf gains).

I want to get the top 32 index of 400 sorting by a float pfWeight.

Have you tried benchmarking for a larger array? The code size of AVX512 qsort is fairly large when compared to std::sort or your bubble sort. If you are sorting an array of just 400 elements, there might be a big front end penalty for AVX512 qsort which might offset all benefits you get from it (not sure if this is true but just wondering if that is why you are not seeing any perf gains).

I have tried to increase the array from 400 to 20480, seems still no benefit from AVX512 qsort. I start to get benefit from the AVX512 qsort when I increase the top items from 32 to 128.