Question: Performance of radix sort with radix 6 and 7
avinashcpandey opened this issue · 5 comments
I see the radix sort implementation using 5 iterations: 7 + 7 + 6 + 6 + 6 = 32 bits.
Usually we use radix4 implementation. With radix6 and 7 how much performance you are getting?
As per my understanding when we increase the radix it will increase the histogram size like for radix 7 it will be 2^7=128 items per thread......i see AMD GPU has only 64kb of LDS. If we use 256 threads then the histogram size will be 128*256 = 3072 integers. This will greatly reduce the performance as only few wavefronts would be able to run on each CU due to lack of LDS space.
Correct me if i am wrong.
Can you put some light on the implementation with radix6 and 7size.?
You may run benchmark_hip_device_radix_sort
.
I have one old file with results but I'm not sure the results are up-to-date (but should be close IIRC):
Vega20
sort_keys<int>/manual_time 111 ms 1 ms 6 11.2297GB/s 2.80742G items/s
sort_keys<long long>/manual_time 287 ms 1 ms 2 8.69748GB/s 1113.28M items/s
sort_keys<char>/manual_time 30 ms 1 ms 23 10.3015GB/s 10.3015G items/s
sort_keys<short>/manual_time 55 ms 1 ms 13 11.3174GB/s 5.65868G items/s
sort_pairs<int, float>/manual_time 153 ms 1 ms 5 16.3351GB/s 2.04189G items/s
sort_pairs<int, double>/manual_time 264 ms 1 ms 3 14.2078GB/s 1.18398G items/s
sort_pairs<int, custom_float2>/manual_time 265 ms 1 ms 3 14.1767GB/s 1.18139G items/s
sort_pairs<int, custom_double2>/manual_time 300 ms 1 ms 2 20.8282GB/s 1066.4M items/s
sort_pairs<long long, float>/manual_time 550 ms 1 ms 1 6.81494GB/s 581.542M items/s
sort_pairs<long long, double>/manual_time 653 ms 1 ms 1 7.65336GB/s 489.815M items/s
sort_pairs<long long, custom_float2>/manual_time 630 ms 1 ms 1 7.93732GB/s 507.989M items/s
sort_pairs<long long, custom_double2>/manual_time 650 ms 1 ms 1 11.5404GB/s 492.389M items/s
9300
sort_keys<int>/manual_time 133 ms 1 ms 5 9.37972GB/s 2.34493G items/s
sort_keys<long long>/manual_time 409 ms 2 ms 2 6.11434GB/s 782.635M items/s
sort_keys<char>/manual_time 31 ms 0 ms 23 10.1063GB/s 10.1063G items/s
sort_keys<short>/manual_time 65 ms 1 ms 12 9.60201GB/s 4.801G items/s
sort_pairs<int, float>/manual_time 198 ms 1 ms 3 12.5979GB/s 1.57473G items/s
sort_pairs<int, double>/manual_time 278 ms 1 ms 3 13.4663GB/s 1.12219G items/s
sort_pairs<int, custom_float2>/manual_time 243 ms 1 ms 3 15.4204GB/s 1.28503G items/s
sort_pairs<int, custom_double2>/manual_time 540 ms 1 ms 1 11.5652GB/s 592.14M items/s
sort_pairs<long long, float>/manual_time 571 ms 2 ms 1 6.56262GB/s 560.011M items/s
sort_pairs<long long, double>/manual_time 739 ms 3 ms 1 6.76718GB/s 433.1M items/s
sort_pairs<long long, custom_float2>/manual_time 681 ms 2 ms 1 7.34559GB/s 470.118M items/s
sort_pairs<long long, custom_double2>/manual_time 1268 ms 3 ms 1 5.91699GB/s 252.458M items/s
MI25
sort_keys<int>/manual_time 98 ms 1 ms 7 12.7012GB/s 3.17529G items/s
sort_keys<long long>/manual_time 332 ms 3 ms 2 7.51984GB/s 962.539M items/s
sort_keys<char>/manual_time 21 ms 1 ms 34 15.2263GB/s 15.2263G items/s
sort_keys<short>/manual_time 39 ms 1 ms 18 15.9251GB/s 7.96255G items/s
sort_pairs<int, float>/manual_time 182 ms 1 ms 4 13.7336GB/s 1.7167G items/s
sort_pairs<int, double>/manual_time 222 ms 1 ms 3 16.9125GB/s 1.40938G items/s
sort_pairs<int, custom_float2>/manual_time 222 ms 1 ms 3 16.8677GB/s 1.40564G items/s
sort_pairs<int, custom_double2>/manual_time 468 ms 1 ms 2 13.3555GB/s 683.804M items/s
sort_pairs<long long, float>/manual_time 503 ms 1 ms 1 7.45113GB/s 635.83M items/s
sort_pairs<long long, double>/manual_time 604 ms 1 ms 1 8.28461GB/s 530.215M items/s
sort_pairs<long long, custom_float2>/manual_time 598 ms 3 ms 1 8.35538GB/s 534.744M items/s
sort_pairs<long long, custom_double2>/manual_time 1061 ms 3 ms 1 7.06978GB/s 301.644M items/s
This is for 32M items.
Usually we use radix4 implementation.
What performance do you have for radix4?
I'll answer your question about histogram a bit later.
I think that I can say that we don't use histograms the way you described.
So for the "7 + 7 + 6 + 6 + 6" case we run 5 iterations of these steps (let's use 7 bits as an example):
- calculate global histograms: how many items for each radix value (0..127) in each block
- calculate where each block will write items of each radix value (128 scans)
- calculate where items of each radix value start (one scan)
- sort items in each block (
BlockSize * ItemsPerThread
items) - scatter sorted items using positions calculated in 2 and 3
Step 4 doesn't use histograms for block-level sorting, in fact you may consider block_radix_sort
as 7 iterations of 1-bit sorting, i.e. it's something like radix1 histogram. To calculate such "histograms" I use quite efficient "bit" scan which uses ballot and masked bit count for warp-level scan.
So LDS is used only for exchanging items between threads, we need to store BlockSize * ItemsPerThread
items (256 * 15 ints
, for example).
With Radix4 i see below performance on Vega10 which has 64 compute Units
Performance of Radix Sort key value pair where key is float type
elsatp time 21.4826ms
szElements 32000000
This timing doesn't include data transfer from host to device and device to host
What is the performance with rocPRIM on key value pair where key is float?
Are you seeing better than this?
More performance data:
Size | Time in ms |
---|---|
50000 | 0.438831 |
100000 | 0.474663 |
200000 | 0.520248 |
400000 | 0.672608 |
800000 | 1.00078 |
1600000 | 1.48771 |
3200000 | 2.62033 |
6400000 | 4.63337 |
12800000 | 8.84509 |
25600000 | 17.3193 |
51200000 | 34.1521 |
102400000 | 66.0865 |
204800000 | 132.304 |
Newer results on ROCm 2.2/2.3:
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
Vega 20 [Radeon Pro Vega 20] 2.2
sort_keys<int>/manual_time 82 ms 2 ms 9 15.2244GB/s 3.80609G items/s
sort_keys<long long>/manual_time 218 ms 73 ms 3 11.4501GB/s 1.43127G items/s
sort_keys<float>/manual_time 80 ms 80 ms 9 15.6974GB/s 3.92435G items/s
sort_keys<double>/manual_time 217 ms 217 ms 3 11.5044GB/s 1.43805G items/s
sort_keys<char>/manual_time 19 ms 18 ms 37 16.2946GB/s 16.2946G items/s
sort_keys<short>/manual_time 38 ms 37 ms 18 16.2368GB/s 8.11839G items/s
sort_pairs<int, float>/manual_time 121 ms 2 ms 6 20.6264GB/s 2.5783G items/s
sort_pairs<int, double>/manual_time 197 ms 1 ms 4 19.05GB/s 1.5875G items/s
sort_pairs<int, custom_float2>/manual_time 200 ms 1 ms 4 18.7696GB/s 1.56413G items/s
sort_pairs<int, custom_double2>/manual_time 275 ms 273 ms 3 22.7394GB/s 1.13697G items/s
sort_pairs<float, float>/manual_time 114 ms 114 ms 6 21.9102GB/s 2.73878G items/s
sort_pairs<float, double>/manual_time 190 ms 188 ms 4 19.7542GB/s 1.64618G items/s
sort_pairs<float, custom_float2>/manual_time 192 ms 190 ms 4 19.5257GB/s 1.62714G items/s
sort_pairs<float, custom_double2>/manual_time 257 ms 255 ms 3 24.3047GB/s 1.21524G items/s
sort_pairs<long long, float>/manual_time 404 ms 400 ms 2 9.28342GB/s 792.186M items/s
sort_pairs<long long, double>/manual_time 479 ms 475 ms 2 10.4415GB/s 668.254M items/s
sort_pairs<long long, custom_float2>/manual_time 473 ms 473 ms 2 10.5676GB/s 676.323M items/s
sort_pairs<long long, custom_double2>/manual_time 626 ms 623 ms 1 11.9751GB/s 510.938M items/s
sort_pairs<double, float>/manual_time 409 ms 2 ms 2 9.16793GB/s 782.33M items/s
sort_pairs<double, double>/manual_time 484 ms 2 ms 2 10.3371GB/s 661.576M items/s
sort_pairs<double, custom_float2>/manual_time 478 ms 2 ms 2 10.4629GB/s 669.627M items/s
sort_pairs<double, custom_double2>/manual_time 593 ms 593 ms 1 12.6523GB/s 539.833M items/s
Fiji [Radeon R9 FURY / NANO Series] 2.3 -DCMAKE_CXX_FLAGS=-DROCPRIM_TARGET_ARCH=803
sort_keys<int>/manual_time 147 ms 2 ms 4 8.47622GB/s 2.11906G items/s
sort_keys<long long>/manual_time 423 ms 4 ms 2 5.9156GB/s 757.197M items/s
sort_keys<float>/manual_time 148 ms 30 ms 5 8.44191GB/s 2.11048G items/s
sort_keys<double>/manual_time 443 ms 443 ms 2 5.63737GB/s 721.583M items/s
sort_keys<char>/manual_time 31 ms 31 ms 22 10.0914GB/s 10.0914G items/s
sort_keys<short>/manual_time 68 ms 68 ms 12 9.25364GB/s 4.62682G items/s
sort_pairs<int, float>/manual_time 206 ms 206 ms 3 12.1418GB/s 1.51773G items/s
sort_pairs<int, double>/manual_time 293 ms 293 ms 2 12.8072GB/s 1092.88M items/s
sort_pairs<int, custom_float2>/manual_time 294 ms 294 ms 2 12.7493GB/s 1087.94M items/s
sort_pairs<int, custom_double2>/manual_time 564 ms 564 ms 1 11.0839GB/s 567.494M items/s
sort_pairs<float, float>/manual_time 204 ms 204 ms 3 12.2681GB/s 1.53352G items/s
sort_pairs<float, double>/manual_time 290 ms 290 ms 2 12.9205GB/s 1102.55M items/s
sort_pairs<float, custom_float2>/manual_time 295 ms 295 ms 2 12.7127GB/s 1084.82M items/s
sort_pairs<float, custom_double2>/manual_time 499 ms 499 ms 2 12.5331GB/s 641.695M items/s
sort_pairs<long long, float>/manual_time 700 ms 700 ms 1 5.35667GB/s 457.103M items/s
sort_pairs<long long, double>/manual_time 766 ms 766 ms 1 6.52996GB/s 417.918M items/s
sort_pairs<long long, custom_float2>/manual_time 776 ms 1 ms 1 6.44308GB/s 412.357M items/s
sort_pairs<long long, custom_double2>/manual_time 1324 ms 2 ms 1 5.66297GB/s 241.62M items/s
sort_pairs<double, float>/manual_time 609 ms 3 ms 1 6.1557GB/s 525.286M items/s
sort_pairs<double, double>/manual_time 811 ms 2 ms 1 6.16156GB/s 394.34M items/s
sort_pairs<double, custom_float2>/manual_time 805 ms 3 ms 1 6.20775GB/s 397.296M items/s
sort_pairs<double, custom_double2>/manual_time 1223 ms 3 ms 1 6.13253GB/s 261.654M items/s
Vega 10 [Radeon Instinct MI25] 2.3
sort_keys<int>/manual_time 86 ms 2 ms 8 14.5287GB/s 3.63219G items/s
sort_keys<long long>/manual_time 301 ms 3 ms 2 8.30259GB/s 1062.73M items/s
sort_keys<float>/manual_time 85 ms 83 ms 8 14.7651GB/s 3.69127G items/s
sort_keys<double>/manual_time 300 ms 298 ms 2 8.33531GB/s 1066.92M items/s
sort_keys<char>/manual_time 19 ms 18 ms 36 16.7683GB/s 16.7683G items/s
sort_keys<short>/manual_time 39 ms 38 ms 18 16.0629GB/s 8.03145G items/s
sort_pairs<int, float>/manual_time 141 ms 140 ms 5 17.7015GB/s 2.21269G items/s
sort_pairs<int, double>/manual_time 194 ms 1 ms 4 19.2917GB/s 1.60764G items/s
sort_pairs<int, custom_float2>/manual_time 196 ms 1 ms 4 19.0876GB/s 1.59064G items/s
sort_pairs<int, custom_double2>/manual_time 340 ms 1 ms 2 18.3827GB/s 941.196M items/s
sort_pairs<float, float>/manual_time 134 ms 81 ms 5 18.6017GB/s 2.32521G items/s
sort_pairs<float, double>/manual_time 188 ms 186 ms 4 19.9576GB/s 1.66314G items/s
sort_pairs<float, custom_float2>/manual_time 190 ms 188 ms 4 19.7484GB/s 1.6457G items/s
sort_pairs<float, custom_double2>/manual_time 321 ms 319 ms 2 19.4688GB/s 996.804M items/s
sort_pairs<long long, float>/manual_time 460 ms 456 ms 2 8.15707GB/s 696.07M items/s
sort_pairs<long long, double>/manual_time 552 ms 549 ms 1 9.05005GB/s 579.203M items/s
sort_pairs<long long, custom_float2>/manual_time 549 ms 549 ms 1 9.10125GB/s 582.48M items/s
sort_pairs<long long, custom_double2>/manual_time 804 ms 801 ms 1 9.32271GB/s 397.769M items/s
sort_pairs<double, float>/manual_time 461 ms 457 ms 2 8.13677GB/s 694.338M items/s
sort_pairs<double, double>/manual_time 544 ms 540 ms 1 9.18701GB/s 587.969M items/s
sort_pairs<double, custom_float2>/manual_time 543 ms 539 ms 1 9.21393GB/s 589.692M items/s
sort_pairs<double, custom_double2>/manual_time 779 ms 775 ms 1 9.62865GB/s 410.822M items/s
Integer keys are uniformly random [std::numeric_limits<key_type>::min(), std::numeric_limits<key_type>::max()]
, floating point keys are uniformly random [-1000, 1000]
. This explains why benchmarks with float keys are a bit faster than ints: not all bits as random -> less scattered writes to global memory.
If you want you can change this behavior in the benchmark source to match your keys generation scheme.
So Vega 10 [Radeon Instinct MI25]
for 32M float-float shows 2.32521G items/s, 13.4 ms (the Time column is sum of 10 runs).
Also, configs (radix sizes, items per thread) may be non-optimal for some types on the current ROCm versions. So config tuning may increase performance slightly.
cool. Thanks Anton!