ROCm/rocPRIM

Question: Performance of radix sort with radix 6 and 7

avinashcpandey opened this issue · 5 comments

I see the radix sort implementation using 5 iterations: 7 + 7 + 6 + 6 + 6 = 32 bits.
Usually we use radix4 implementation. With radix6 and 7 how much performance you are getting?
As per my understanding when we increase the radix it will increase the histogram size like for radix 7 it will be 2^7=128 items per thread......i see AMD GPU has only 64kb of LDS. If we use 256 threads then the histogram size will be 128*256 = 3072 integers. This will greatly reduce the performance as only few wavefronts would be able to run on each CU due to lack of LDS space.

Correct me if i am wrong.
Can you put some light on the implementation with radix6 and 7size.?

You may run benchmark_hip_device_radix_sort.
I have one old file with results but I'm not sure the results are up-to-date (but should be close IIRC):

Vega20

sort_keys<int>/manual_time                               111 ms          1 ms          6   11.2297GB/s   2.80742G items/s
sort_keys<long long>/manual_time                         287 ms          1 ms          2   8.69748GB/s   1113.28M items/s
sort_keys<char>/manual_time                               30 ms          1 ms         23   10.3015GB/s   10.3015G items/s
sort_keys<short>/manual_time                              55 ms          1 ms         13   11.3174GB/s   5.65868G items/s
sort_pairs<int, float>/manual_time                       153 ms          1 ms          5   16.3351GB/s   2.04189G items/s
sort_pairs<int, double>/manual_time                      264 ms          1 ms          3   14.2078GB/s   1.18398G items/s
sort_pairs<int, custom_float2>/manual_time               265 ms          1 ms          3   14.1767GB/s   1.18139G items/s
sort_pairs<int, custom_double2>/manual_time              300 ms          1 ms          2   20.8282GB/s    1066.4M items/s
sort_pairs<long long, float>/manual_time                 550 ms          1 ms          1   6.81494GB/s   581.542M items/s
sort_pairs<long long, double>/manual_time                653 ms          1 ms          1   7.65336GB/s   489.815M items/s
sort_pairs<long long, custom_float2>/manual_time         630 ms          1 ms          1   7.93732GB/s   507.989M items/s
sort_pairs<long long, custom_double2>/manual_time        650 ms          1 ms          1   11.5404GB/s   492.389M items/s

9300

sort_keys<int>/manual_time                               133 ms          1 ms          5   9.37972GB/s   2.34493G items/s
sort_keys<long long>/manual_time                         409 ms          2 ms          2   6.11434GB/s   782.635M items/s
sort_keys<char>/manual_time                               31 ms          0 ms         23   10.1063GB/s   10.1063G items/s
sort_keys<short>/manual_time                              65 ms          1 ms         12   9.60201GB/s     4.801G items/s
sort_pairs<int, float>/manual_time                       198 ms          1 ms          3   12.5979GB/s   1.57473G items/s
sort_pairs<int, double>/manual_time                      278 ms          1 ms          3   13.4663GB/s   1.12219G items/s
sort_pairs<int, custom_float2>/manual_time               243 ms          1 ms          3   15.4204GB/s   1.28503G items/s
sort_pairs<int, custom_double2>/manual_time              540 ms          1 ms          1   11.5652GB/s    592.14M items/s
sort_pairs<long long, float>/manual_time                 571 ms          2 ms          1   6.56262GB/s   560.011M items/s
sort_pairs<long long, double>/manual_time                739 ms          3 ms          1   6.76718GB/s     433.1M items/s
sort_pairs<long long, custom_float2>/manual_time         681 ms          2 ms          1   7.34559GB/s   470.118M items/s
sort_pairs<long long, custom_double2>/manual_time       1268 ms          3 ms          1   5.91699GB/s   252.458M items/s

MI25

sort_keys<int>/manual_time                                98 ms          1 ms          7   12.7012GB/s   3.17529G items/s
sort_keys<long long>/manual_time                         332 ms          3 ms          2   7.51984GB/s   962.539M items/s
sort_keys<char>/manual_time                               21 ms          1 ms         34   15.2263GB/s   15.2263G items/s
sort_keys<short>/manual_time                              39 ms          1 ms         18   15.9251GB/s   7.96255G items/s
sort_pairs<int, float>/manual_time                       182 ms          1 ms          4   13.7336GB/s    1.7167G items/s
sort_pairs<int, double>/manual_time                      222 ms          1 ms          3   16.9125GB/s   1.40938G items/s
sort_pairs<int, custom_float2>/manual_time               222 ms          1 ms          3   16.8677GB/s   1.40564G items/s
sort_pairs<int, custom_double2>/manual_time              468 ms          1 ms          2   13.3555GB/s   683.804M items/s
sort_pairs<long long, float>/manual_time                 503 ms          1 ms          1   7.45113GB/s    635.83M items/s
sort_pairs<long long, double>/manual_time                604 ms          1 ms          1   8.28461GB/s   530.215M items/s
sort_pairs<long long, custom_float2>/manual_time         598 ms          3 ms          1   8.35538GB/s   534.744M items/s
sort_pairs<long long, custom_double2>/manual_time       1061 ms          3 ms          1   7.06978GB/s   301.644M items/s

This is for 32M items.

Usually we use radix4 implementation.

What performance do you have for radix4?

I'll answer your question about histogram a bit later.

I think that I can say that we don't use histograms the way you described.

So for the "7 + 7 + 6 + 6 + 6" case we run 5 iterations of these steps (let's use 7 bits as an example):

  1. calculate global histograms: how many items for each radix value (0..127) in each block
  2. calculate where each block will write items of each radix value (128 scans)
  3. calculate where items of each radix value start (one scan)
  4. sort items in each block (BlockSize * ItemsPerThread items)
  5. scatter sorted items using positions calculated in 2 and 3

Step 4 doesn't use histograms for block-level sorting, in fact you may consider block_radix_sort as 7 iterations of 1-bit sorting, i.e. it's something like radix1 histogram. To calculate such "histograms" I use quite efficient "bit" scan which uses ballot and masked bit count for warp-level scan.

So LDS is used only for exchanging items between threads, we need to store BlockSize * ItemsPerThread items (256 * 15 ints, for example).

With Radix4 i see below performance on Vega10 which has 64 compute Units
Performance of Radix Sort key value pair where key is float type
elsatp time 21.4826ms
szElements 32000000
This timing doesn't include data transfer from host to device and device to host
What is the performance with rocPRIM on key value pair where key is float?
Are you seeing better than this?

More performance data:

Size Time in ms
50000 0.438831
100000 0.474663
200000 0.520248
400000 0.672608
800000 1.00078
1600000 1.48771
3200000 2.62033
6400000 4.63337
12800000 8.84509
25600000 17.3193
51200000 34.1521
102400000 66.0865
204800000 132.304

Newer results on ROCm 2.2/2.3:

-----------------------------------------------------------------------------------------
Benchmark                                                  Time           CPU Iterations
-----------------------------------------------------------------------------------------

Vega 20 [Radeon Pro Vega 20] 2.2

sort_keys<int>/manual_time                                82 ms          2 ms          9   15.2244GB/s   3.80609G items/s
sort_keys<long long>/manual_time                         218 ms         73 ms          3   11.4501GB/s   1.43127G items/s
sort_keys<float>/manual_time                              80 ms         80 ms          9   15.6974GB/s   3.92435G items/s
sort_keys<double>/manual_time                            217 ms        217 ms          3   11.5044GB/s   1.43805G items/s
sort_keys<char>/manual_time                               19 ms         18 ms         37   16.2946GB/s   16.2946G items/s
sort_keys<short>/manual_time                              38 ms         37 ms         18   16.2368GB/s   8.11839G items/s
sort_pairs<int, float>/manual_time                       121 ms          2 ms          6   20.6264GB/s    2.5783G items/s
sort_pairs<int, double>/manual_time                      197 ms          1 ms          4     19.05GB/s    1.5875G items/s
sort_pairs<int, custom_float2>/manual_time               200 ms          1 ms          4   18.7696GB/s   1.56413G items/s
sort_pairs<int, custom_double2>/manual_time              275 ms        273 ms          3   22.7394GB/s   1.13697G items/s
sort_pairs<float, float>/manual_time                     114 ms        114 ms          6   21.9102GB/s   2.73878G items/s
sort_pairs<float, double>/manual_time                    190 ms        188 ms          4   19.7542GB/s   1.64618G items/s
sort_pairs<float, custom_float2>/manual_time             192 ms        190 ms          4   19.5257GB/s   1.62714G items/s
sort_pairs<float, custom_double2>/manual_time            257 ms        255 ms          3   24.3047GB/s   1.21524G items/s
sort_pairs<long long, float>/manual_time                 404 ms        400 ms          2   9.28342GB/s   792.186M items/s
sort_pairs<long long, double>/manual_time                479 ms        475 ms          2   10.4415GB/s   668.254M items/s
sort_pairs<long long, custom_float2>/manual_time         473 ms        473 ms          2   10.5676GB/s   676.323M items/s
sort_pairs<long long, custom_double2>/manual_time        626 ms        623 ms          1   11.9751GB/s   510.938M items/s
sort_pairs<double, float>/manual_time                    409 ms          2 ms          2   9.16793GB/s    782.33M items/s
sort_pairs<double, double>/manual_time                   484 ms          2 ms          2   10.3371GB/s   661.576M items/s
sort_pairs<double, custom_float2>/manual_time            478 ms          2 ms          2   10.4629GB/s   669.627M items/s
sort_pairs<double, custom_double2>/manual_time           593 ms        593 ms          1   12.6523GB/s   539.833M items/s

Fiji [Radeon R9 FURY / NANO Series] 2.3 -DCMAKE_CXX_FLAGS=-DROCPRIM_TARGET_ARCH=803

sort_keys<int>/manual_time                               147 ms          2 ms          4   8.47622GB/s   2.11906G items/s
sort_keys<long long>/manual_time                         423 ms          4 ms          2    5.9156GB/s   757.197M items/s
sort_keys<float>/manual_time                             148 ms         30 ms          5   8.44191GB/s   2.11048G items/s
sort_keys<double>/manual_time                            443 ms        443 ms          2   5.63737GB/s   721.583M items/s
sort_keys<char>/manual_time                               31 ms         31 ms         22   10.0914GB/s   10.0914G items/s
sort_keys<short>/manual_time                              68 ms         68 ms         12   9.25364GB/s   4.62682G items/s
sort_pairs<int, float>/manual_time                       206 ms        206 ms          3   12.1418GB/s   1.51773G items/s
sort_pairs<int, double>/manual_time                      293 ms        293 ms          2   12.8072GB/s   1092.88M items/s
sort_pairs<int, custom_float2>/manual_time               294 ms        294 ms          2   12.7493GB/s   1087.94M items/s
sort_pairs<int, custom_double2>/manual_time              564 ms        564 ms          1   11.0839GB/s   567.494M items/s
sort_pairs<float, float>/manual_time                     204 ms        204 ms          3   12.2681GB/s   1.53352G items/s
sort_pairs<float, double>/manual_time                    290 ms        290 ms          2   12.9205GB/s   1102.55M items/s
sort_pairs<float, custom_float2>/manual_time             295 ms        295 ms          2   12.7127GB/s   1084.82M items/s
sort_pairs<float, custom_double2>/manual_time            499 ms        499 ms          2   12.5331GB/s   641.695M items/s
sort_pairs<long long, float>/manual_time                 700 ms        700 ms          1   5.35667GB/s   457.103M items/s
sort_pairs<long long, double>/manual_time                766 ms        766 ms          1   6.52996GB/s   417.918M items/s
sort_pairs<long long, custom_float2>/manual_time         776 ms          1 ms          1   6.44308GB/s   412.357M items/s
sort_pairs<long long, custom_double2>/manual_time       1324 ms          2 ms          1   5.66297GB/s    241.62M items/s
sort_pairs<double, float>/manual_time                    609 ms          3 ms          1    6.1557GB/s   525.286M items/s
sort_pairs<double, double>/manual_time                   811 ms          2 ms          1   6.16156GB/s    394.34M items/s
sort_pairs<double, custom_float2>/manual_time            805 ms          3 ms          1   6.20775GB/s   397.296M items/s
sort_pairs<double, custom_double2>/manual_time          1223 ms          3 ms          1   6.13253GB/s   261.654M items/s

Vega 10 [Radeon Instinct MI25] 2.3

sort_keys<int>/manual_time                                86 ms          2 ms          8   14.5287GB/s   3.63219G items/s
sort_keys<long long>/manual_time                         301 ms          3 ms          2   8.30259GB/s   1062.73M items/s
sort_keys<float>/manual_time                              85 ms         83 ms          8   14.7651GB/s   3.69127G items/s
sort_keys<double>/manual_time                            300 ms        298 ms          2   8.33531GB/s   1066.92M items/s
sort_keys<char>/manual_time                               19 ms         18 ms         36   16.7683GB/s   16.7683G items/s
sort_keys<short>/manual_time                              39 ms         38 ms         18   16.0629GB/s   8.03145G items/s
sort_pairs<int, float>/manual_time                       141 ms        140 ms          5   17.7015GB/s   2.21269G items/s
sort_pairs<int, double>/manual_time                      194 ms          1 ms          4   19.2917GB/s   1.60764G items/s
sort_pairs<int, custom_float2>/manual_time               196 ms          1 ms          4   19.0876GB/s   1.59064G items/s
sort_pairs<int, custom_double2>/manual_time              340 ms          1 ms          2   18.3827GB/s   941.196M items/s
sort_pairs<float, float>/manual_time                     134 ms         81 ms          5   18.6017GB/s   2.32521G items/s
sort_pairs<float, double>/manual_time                    188 ms        186 ms          4   19.9576GB/s   1.66314G items/s
sort_pairs<float, custom_float2>/manual_time             190 ms        188 ms          4   19.7484GB/s    1.6457G items/s
sort_pairs<float, custom_double2>/manual_time            321 ms        319 ms          2   19.4688GB/s   996.804M items/s
sort_pairs<long long, float>/manual_time                 460 ms        456 ms          2   8.15707GB/s    696.07M items/s
sort_pairs<long long, double>/manual_time                552 ms        549 ms          1   9.05005GB/s   579.203M items/s
sort_pairs<long long, custom_float2>/manual_time         549 ms        549 ms          1   9.10125GB/s    582.48M items/s
sort_pairs<long long, custom_double2>/manual_time        804 ms        801 ms          1   9.32271GB/s   397.769M items/s
sort_pairs<double, float>/manual_time                    461 ms        457 ms          2   8.13677GB/s   694.338M items/s
sort_pairs<double, double>/manual_time                   544 ms        540 ms          1   9.18701GB/s   587.969M items/s
sort_pairs<double, custom_float2>/manual_time            543 ms        539 ms          1   9.21393GB/s   589.692M items/s
sort_pairs<double, custom_double2>/manual_time           779 ms        775 ms          1   9.62865GB/s   410.822M items/s

Integer keys are uniformly random [std::numeric_limits<key_type>::min(), std::numeric_limits<key_type>::max()], floating point keys are uniformly random [-1000, 1000]. This explains why benchmarks with float keys are a bit faster than ints: not all bits as random -> less scattered writes to global memory.
If you want you can change this behavior in the benchmark source to match your keys generation scheme.

So Vega 10 [Radeon Instinct MI25] for 32M float-float shows 2.32521G items/s, 13.4 ms (the Time column is sum of 10 runs).

Also, configs (radix sizes, items per thread) may be non-optimal for some types on the current ROCm versions. So config tuning may increase performance slightly.

cool. Thanks Anton!