Status: Archive (code is provided as-is, no updates expected)
Open single and half precision gemm implementations. The main speedups over cublas are with small minibatch and in fp16 data formats.
The demonstration code currently depends on Nervana neon:
git clone git@github.com:NervanaSystems/neon.git
cd neon
make
. .venv/bin/activate
Clone and run this repo:
git clone git@github.com:openai/openai-gemm.git
Run the benchmark:
./benchmark.py
Run the unit test:
./test.py
( https://github.com/baidu-research/DeepBench )
M | N | K | Op | OpenAI_32 | cuBLAS_32 | ratio_32 | OpenAI_16 | cuBLAS_16 | ratio_16 |
---|---|---|---|---|---|---|---|---|---|
16 | 1760 | 1760 | NN | 2557 | 2195 | 1.2 | 3507 | 346 | 10.1 |
32 | 1760 | 1760 | NN | 5010 | 1128 | 4.4 | 6814 | 526 | 13.0 |
64 | 1760 | 1760 | NN | 6486 | 4112 | 1.6 | 8235 | 2801 | 2.9 |
128 | 1760 | 1760 | NN | 7068 | 6931 | 1.0 | 9400 | 5307 | 1.8 |
7000 | 1760 | 1760 | NN | 9968 | 9584 | 1.0 | 10515 | 9807 | 1.1 |
16 | 2048 | 2048 | NN | 2569 | 1516 | 1.7 | 3619 | 242 | 15.0 |
32 | 2048 | 2048 | NN | 5034 | 1356 | 3.7 | 6576 | 606 | 10.8 |
64 | 2048 | 2048 | NN | 6636 | 2815 | 2.4 | 8285 | 3241 | 2.6 |
128 | 2048 | 2048 | NN | 7316 | 6373 | 1.1 | 9066 | 5334 | 1.7 |
7000 | 2048 | 2048 | NN | 10081 | 9900 | 1.0 | 11275 | 9948 | 1.1 |
16 | 2560 | 2560 | NN | 2718 | 1312 | 2.1 | 4312 | 251 | 17.2 |
32 | 2560 | 2560 | NN | 5370 | 1660 | 3.2 | 7525 | 749 | 10.0 |
64 | 2560 | 2560 | NN | 7331 | 2687 | 2.7 | 8436 | 951 | 8.9 |
128 | 2560 | 2560 | NN | 8007 | 5238 | 1.5 | 9277 | 6123 | 1.5 |
7000 | 2560 | 2560 | NN | 10282 | 10131 | 1.0 | 11027 | 9974 | 1.1 |
16 | 4096 | 4096 | NN | 2695 | 1110 | 2.4 | 4442 | 266 | 16.7 |
32 | 4096 | 4096 | NN | 5266 | 2264 | 2.3 | 7723 | 758 | 10.2 |
64 | 4096 | 4096 | NN | 6942 | 3922 | 1.8 | 8904 | 1055 | 8.4 |
128 | 4096 | 4096 | NN | 8127 | 5686 | 1.4 | 9711 | 5681 | 1.7 |
7000 | 4096 | 4096 | NN | 10462 | 10082 | 1.0 | 11152 | 9991 | 1.1 |
16 | 1760 | 1760 | NT | 1719 | 1095 | 1.6 | 2692 | 290 | 9.3 |
32 | 1760 | 1760 | NT | 3316 | 1312 | 2.5 | 5068 | 447 | 11.3 |
64 | 1760 | 1760 | NT | 5247 | 1955 | 2.7 | 7621 | 1797 | 4.2 |
128 | 1760 | 1760 | NT | 6720 | 3393 | 2.0 | 8886 | 3342 | 2.7 |
7000 | 1760 | 1760 | NT | 9341 | 8513 | 1.1 | 10085 | 9635 | 1.0 |
16 | 2048 | 2048 | NT | 2442 | 1231 | 2.0 | 3641 | 299 | 12.2 |
32 | 2048 | 2048 | NT | 4801 | 1251 | 3.8 | 5849 | 468 | 12.5 |
64 | 2048 | 2048 | NT | 6317 | 1967 | 3.2 | 7825 | 3128 | 2.5 |
128 | 2048 | 2048 | NT | 7176 | 5041 | 1.4 | 8616 | 4843 | 1.8 |
7000 | 2048 | 2048 | NT | 9975 | 9173 | 1.1 | 10741 | 9560 | 1.1 |
16 | 2560 | 2560 | NT | 1834 | 1208 | 1.5 | 3154 | 297 | 10.6 |
32 | 2560 | 2560 | NT | 3610 | 1436 | 2.5 | 5418 | 584 | 9.3 |
64 | 2560 | 2560 | NT | 6083 | 2815 | 2.2 | 8331 | 1042 | 8.0 |
128 | 2560 | 2560 | NT | 7702 | 3246 | 2.4 | 8857 | 5259 | 1.7 |
7000 | 2560 | 2560 | NT | 9257 | 7829 | 1.2 | 10659 | 9548 | 1.1 |
16 | 4096 | 4096 | NT | 2546 | 1297 | 2.0 | 4164 | 309 | 13.5 |
32 | 4096 | 4096 | NT | 4992 | 2290 | 2.2 | 8156 | 775 | 10.5 |
64 | 4096 | 4096 | NT | 6746 | 4157 | 1.6 | 8429 | 1381 | 6.1 |
128 | 4096 | 4096 | NT | 7843 | 5425 | 1.4 | 9298 | 5527 | 1.7 |
7000 | 4096 | 4096 | NT | 9925 | 6879 | 1.4 | 10630 | 9784 | 1.1 |
7133 | 1760 | 1760 | TN | 9752 | 10186 | 1.0 | 10517 | 8912 | 1.2 |
7133 | 2048 | 2048 | TN | 10485 | 10319 | 1.0 | 10674 | 9608 | 1.1 |
7133 | 2560 | 2560 | TN | 10743 | 11057 | 1.0 | 11195 | 10059 | 1.1 |
7133 | 4096 | 4096 | TN | 10384 | 10290 | 1.0 | 10980 | 10558 | 1.0 |
9124 | 5124 | 1760 | NN | 9920 | 9480 | 1.0 | 10580 | 9743 | 1.1 |
9124 | 5124 | 2048 | NN | 10008 | 9415 | 1.1 | 10602 | 9796 | 1.1 |
9124 | 5124 | 2560 | NN | 9925 | 9426 | 1.1 | 10586 | 9850 | 1.1 |
9124 | 5124 | 4096 | NN | 9982 | 9489 | 1.1 | 10580 | 9472 | 1.1 |
9124 | 5124 | 1760 | NT | 9093 | 3497 | 2.6 | 9302 | 8692 | 1.1 |
9124 | 5124 | 2048 | NT | 9506 | 6512 | 1.5 | 9506 | 8883 | 1.1 |
9124 | 5124 | 2560 | NT | 8704 | 3364 | 2.6 | 9855 | 7733 | 1.3 |
9124 | 5124 | 4096 | NT | 9733 | 6109 | 1.6 | 10278 | 8760 | 1.2 |
8457 | 35 | 1760 | NN | 3343 | 1020 | 3.3 | 3841 | 736 | 5.2 |
8457 | 35 | 2048 | NN | 3419 | 1996 | 1.7 | 4782 | 803 | 6.0 |
8457 | 35 | 2560 | NN | 3415 | 1072 | 3.2 | 3868 | 789 | 4.9 |
8457 | 35 | 4096 | NN | 3743 | 2009 | 1.9 | 4741 | 804 | 5.9 |
8457 | 35 | 1760 | NT | 3574 | 1970 | 1.8 | 4176 | 1243 | 3.4 |
8457 | 35 | 2048 | NT | 4564 | 3069 | 1.5 | 4818 | 1255 | 3.8 |
8457 | 35 | 2560 | NT | 3598 | 2062 | 1.7 | 3597 | 1135 | 3.2 |
8457 | 35 | 4096 | NT | 4311 | 2990 | 1.4 | 4927 | 1303 | 3.8 |
16 | 7680 | 2560 | NN | 2683 | 718 | 3.7 | 4449 | 289 | 15.4 |
32 | 7680 | 2560 | NN | 5304 | 3660 | 1.4 | 7837 | 979 | 8.0 |
64 | 7680 | 2560 | NN | 7311 | 4979 | 1.5 | 9310 | 1274 | 7.3 |
128 | 7680 | 2560 | NN | 7931 | 6109 | 1.3 | 9390 | 6591 | 1.4 |
16 | 7680 | 2560 | NT | 1885 | 1191 | 1.6 | 3401 | 290 | 11.7 |
32 | 7680 | 2560 | NT | 3731 | 1808 | 2.1 | 6373 | 1004 | 6.3 |
64 | 7680 | 2560 | NT | 6274 | 3509 | 1.8 | 8809 | 1655 | 5.3 |
128 | 7680 | 2560 | NT | 7957 | 2988 | 2.7 | 9246 | 4695 | 2.0 |
16 | 3072 | 1024 | NN | 2277 | 1295 | 1.8 | 3373 | 282 | 12.0 |
32 | 3072 | 1024 | NN | 4494 | 1798 | 2.5 | 6011 | 807 | 7.4 |
64 | 3072 | 1024 | NN | 6272 | 3046 | 2.1 | 6790 | 917 | 7.4 |
128 | 3072 | 1024 | NN | 7364 | 5436 | 1.4 | 7768 | 5749 | 1.4 |
16 | 3072 | 1024 | NT | 2285 | 1077 | 2.1 | 3439 | 244 | 14.1 |
32 | 3072 | 1024 | NT | 4597 | 1540 | 3.0 | 5645 | 677 | 8.3 |
64 | 3072 | 1024 | NT | 6392 | 2969 | 2.2 | 7555 | 1204 | 6.3 |
128 | 3072 | 1024 | NT | 7460 | 5058 | 1.5 | 8586 | 5535 | 1.6 |
7435 | 3072 | 1024 | TN | 9829 | 8804 | 1.1 | 10123 | 9365 | 1.1 |
5481 | 7680 | 2560 | TN | 9448 | 9309 | 1.0 | 9466 | 9394 | 1.0 |
Note that the OpenAI kernels do not yet implement fp16x2 instructions. Even still it seems the current cublas hgemm implentation is only good for large dimensions. There are also accuracy considerations when accumulating large reductions in fp16.
M | N | K | Op | OpenAI_32 | cuBLAS_32 | ratio_32 | OpenAI_16 | cuBLAS_16 | ratio_16 |
---|---|---|---|---|---|---|---|---|---|
16 | 1760 | 1760 | NN | 2595 | 2048 | 1.3 | 2935 | 463 | 6.3 |
32 | 1760 | 1760 | NN | 4963 | 864 | 5.7 | 5766 | 895 | 6.4 |
64 | 1760 | 1760 | NN | 7565 | 3909 | 1.9 | 7760 | 1711 | 4.5 |
128 | 1760 | 1760 | NN | 8140 | 6053 | 1.3 | 8422 | 4089 | 2.1 |
7000 | 1760 | 1760 | NN | 9653 | 8722 | 1.1 | 9617 | 16143 | 0.6 |
16 | 2048 | 2048 | NN | 2255 | 1746 | 1.3 | 3211 | 546 | 5.9 |
32 | 2048 | 2048 | NN | 4467 | 1012 | 4.4 | 4533 | 1019 | 4.4 |
64 | 2048 | 2048 | NN | 6618 | 4198 | 1.6 | 6591 | 2018 | 3.3 |
128 | 2048 | 2048 | NN | 8059 | 5921 | 1.4 | 7936 | 4667 | 1.7 |
7000 | 2048 | 2048 | NN | 9761 | 9346 | 1.0 | 9910 | 18715 | 0.5 |
16 | 2560 | 2560 | NN | 2883 | 2108 | 1.4 | 4210 | 685 | 6.1 |
32 | 2560 | 2560 | NN | 5701 | 1279 | 4.5 | 5820 | 1297 | 4.5 |
64 | 2560 | 2560 | NN | 8100 | 6054 | 1.3 | 8099 | 2558 | 3.2 |
128 | 2560 | 2560 | NN | 8308 | 6799 | 1.2 | 8790 | 5901 | 1.5 |
7000 | 2560 | 2560 | NN | 9740 | 9538 | 1.0 | 9845 | 18499 | 0.5 |
16 | 4096 | 4096 | NN | 3449 | 1342 | 2.6 | 4299 | 1069 | 4.0 |
32 | 4096 | 4096 | NN | 6863 | 2045 | 3.4 | 6907 | 2103 | 3.3 |
64 | 4096 | 4096 | NN | 8404 | 4059 | 2.1 | 8248 | 4183 | 2.0 |
128 | 4096 | 4096 | NN | 8224 | 8039 | 1.0 | 8853 | 8669 | 1.0 |
7000 | 4096 | 4096 | NN | 9818 | 9519 | 1.0 | 10011 | 18588 | 0.5 |
16 | 1760 | 1760 | NT | 2579 | 1324 | 1.9 | 2763 | 428 | 6.4 |
32 | 1760 | 1760 | NT | 5089 | 878 | 5.8 | 5382 | 857 | 6.3 |
64 | 1760 | 1760 | NT | 7501 | 3017 | 2.5 | 7695 | 1695 | 4.5 |
128 | 1760 | 1760 | NT | 8043 | 5494 | 1.5 | 8192 | 3426 | 2.4 |
7000 | 1760 | 1760 | NT | 9477 | 7571 | 1.3 | 9355 | 16113 | 0.6 |
16 | 2048 | 2048 | NT | 2267 | 1276 | 1.8 | 3171 | 504 | 6.3 |
32 | 2048 | 2048 | NT | 4484 | 1026 | 4.4 | 4489 | 1009 | 4.4 |
64 | 2048 | 2048 | NT | 6567 | 3986 | 1.6 | 6551 | 2018 | 3.2 |
128 | 2048 | 2048 | NT | 8019 | 5825 | 1.4 | 7968 | 4496 | 1.8 |
7000 | 2048 | 2048 | NT | 9625 | 9373 | 1.0 | 9713 | 17878 | 0.5 |
16 | 2560 | 2560 | NT | 2870 | 1460 | 2.0 | 4256 | 638 | 6.7 |
32 | 2560 | 2560 | NT | 5614 | 1299 | 4.3 | 5705 | 1271 | 4.5 |
64 | 2560 | 2560 | NT | 8014 | 4402 | 1.8 | 8085 | 2521 | 3.2 |
128 | 2560 | 2560 | NT | 8219 | 5640 | 1.5 | 8240 | 5137 | 1.6 |
7000 | 2560 | 2560 | NT | 9534 | 9091 | 1.0 | 9735 | 18025 | 0.5 |
16 | 4096 | 4096 | NT | 3366 | 1547 | 2.2 | 4354 | 1047 | 4.2 |
32 | 4096 | 4096 | NT | 6714 | 2055 | 3.3 | 6859 | 2093 | 3.3 |
64 | 4096 | 4096 | NT | 8297 | 3445 | 2.4 | 8289 | 4178 | 2.0 |
128 | 4096 | 4096 | NT | 8335 | 7450 | 1.1 | 7911 | 7973 | 1.0 |
7000 | 4096 | 4096 | NT | 9578 | 9214 | 1.0 | 9877 | 18073 | 0.5 |
7133 | 1760 | 1760 | TN | 9704 | 9267 | 1.0 | 9506 | 15605 | 0.6 |
7133 | 2048 | 2048 | TN | 9747 | 9836 | 1.0 | 10012 | 19110 | 0.5 |
7133 | 2560 | 2560 | TN | 9742 | 9748 | 1.0 | 9805 | 19107 | 0.5 |
7133 | 4096 | 4096 | TN | 9807 | 9733 | 1.0 | 10122 | 19559 | 0.5 |
9124 | 5124 | 1760 | NN | 9326 | 9076 | 1.0 | 9631 | 17496 | 0.6 |
9124 | 5124 | 2048 | NN | 9414 | 9054 | 1.0 | 9602 | 17523 | 0.5 |
9124 | 5124 | 2560 | NN | 9353 | 9041 | 1.0 | 9698 | 17380 | 0.6 |
9124 | 5124 | 4096 | NN | 9370 | 9051 | 1.0 | 9689 | 17617 | 0.5 |
9124 | 5124 | 1760 | NT | 9124 | 8746 | 1.0 | 9524 | 16777 | 0.6 |
9124 | 5124 | 2048 | NT | 9294 | 8817 | 1.1 | 9641 | 16935 | 0.6 |
9124 | 5124 | 2560 | NT | 9221 | 8499 | 1.1 | 9637 | 16820 | 0.6 |
9124 | 5124 | 4096 | NT | 9270 | 8961 | 1.0 | 9568 | 17080 | 0.6 |
8457 | 35 | 1760 | NN | 3301 | 2233 | 1.5 | 4505 | 3154 | 1.4 |
8457 | 35 | 2048 | NN | 3265 | 3066 | 1.1 | 4501 | 3335 | 1.3 |
8457 | 35 | 2560 | NN | 3127 | 2300 | 1.4 | 4516 | 3135 | 1.4 |
8457 | 35 | 4096 | NN | 3257 | 3272 | 1.0 | 4729 | 3485 | 1.4 |
8457 | 35 | 1760 | NT | 4563 | 3142 | 1.5 | 4612 | 2998 | 1.5 |
8457 | 35 | 2048 | NT | 4554 | 3202 | 1.4 | 4601 | 3109 | 1.5 |
8457 | 35 | 2560 | NT | 4567 | 3144 | 1.5 | 4654 | 3039 | 1.5 |
8457 | 35 | 4096 | NT | 4353 | 3415 | 1.3 | 4457 | 3257 | 1.4 |
16 | 7680 | 2560 | NN | 3668 | 1200 | 3.1 | 5020 | 1236 | 4.1 |
32 | 7680 | 2560 | NN | 7245 | 3385 | 2.1 | 7519 | 2465 | 3.1 |
64 | 7680 | 2560 | NN | 8440 | 5210 | 1.6 | 8349 | 4910 | 1.7 |
128 | 7680 | 2560 | NN | 8765 | 4872 | 1.8 | 9131 | 11349 | 0.8 |
16 | 7680 | 2560 | NT | 3229 | 1515 | 2.1 | 5032 | 1157 | 4.3 |
32 | 7680 | 2560 | NT | 6640 | 2721 | 2.4 | 6810 | 2307 | 3.0 |
64 | 7680 | 2560 | NT | 8282 | 5113 | 1.6 | 8362 | 4494 | 1.9 |
128 | 7680 | 2560 | NT | 8763 | 4646 | 1.9 | 8617 | 9159 | 0.9 |
16 | 3072 | 1024 | NN | 2929 | 1717 | 1.7 | 3335 | 750 | 4.4 |
32 | 3072 | 1024 | NN | 5801 | 1399 | 4.1 | 6116 | 1420 | 4.3 |
64 | 3072 | 1024 | NN | 6958 | 4340 | 1.6 | 6923 | 2814 | 2.5 |
128 | 3072 | 1024 | NN | 8047 | 6492 | 1.2 | 7769 | 6302 | 1.2 |
16 | 3072 | 1024 | NT | 2990 | 1068 | 2.8 | 3384 | 705 | 4.8 |
32 | 3072 | 1024 | NT | 5834 | 1429 | 4.1 | 6021 | 1411 | 4.3 |
64 | 3072 | 1024 | NT | 6921 | 3500 | 2.0 | 6893 | 2819 | 2.4 |
128 | 3072 | 1024 | NT | 7918 | 6034 | 1.3 | 7876 | 5760 | 1.4 |
7435 | 3072 | 1024 | TN | 9367 | 9391 | 1.0 | 9559 | 17234 | 0.6 |
5481 | 7680 | 2560 | TN | 9672 | 9520 | 1.0 | 9967 | 18832 | 0.5 |