Open single and half precision gemm implementations. The main speedups over cublas are with small minibatch and in fp16 data formats.
The demonstration code currently depends on Nervana neon:
git clone [email protected]:NervanaSystems/neon.git
cd neon
make
. .venv/bin/activate
Clone and run this repo:
git clone [email protected]:openai/openai-gemm.git
Run the benchmark:
./benchmark.py
Run the unit test:
./test.py
( https://github.com/baidu-research/DeepBench )
M | N | K | Op | OpenAI_32 | cuBLAS_32 | ratio_32 | OpenAI_16 | cuBLAS_16 | ratio_16 |
---|---|---|---|---|---|---|---|---|---|
16 | 1760 | 1760 | NN | 2557 | 2195 | 1.2 | 3507 | 346 | 10.1 |
32 | 1760 | 1760 | NN | 5010 | 1128 | 4.4 | 6814 | 526 | 13.0 |
64 | 1760 | 1760 | NN | 6486 | 4112 | 1.6 | 8235 | 2801 | 2.9 |
128 | 1760 | 1760 | NN | 7068 | 6931 | 1.0 | 9400 | 5307 | 1.8 |
7000 | 1760 | 1760 | NN | 9968 | 9584 | 1.0 | 10515 | 9807 | 1.1 |
16 | 2048 | 2048 | NN | 2569 | 1516 | 1.7 | 3619 | 242 | 15.0 |
32 | 2048 | 2048 | NN | 5034 | 1356 | 3.7 | 6576 | 606 | 10.8 |
64 | 2048 | 2048 | NN | 6636 | 2815 | 2.4 | 8285 | 3241 | 2.6 |
128 | 2048 | 2048 | NN | 7316 | 6373 | 1.1 | 9066 | 5334 | 1.7 |
7000 | 2048 | 2048 | NN | 10081 | 9900 | 1.0 | 11275 | 9948 | 1.1 |
16 | 2560 | 2560 | NN | 2718 | 1312 | 2.1 | 4312 | 251 | 17.2 |
32 | 2560 | 2560 | NN | 5370 | 1660 | 3.2 | 7525 | 749 | 10.0 |
64 | 2560 | 2560 | NN | 7331 | 2687 | 2.7 | 8436 | 951 | 8.9 |
128 | 2560 | 2560 | NN | 8007 | 5238 | 1.5 | 9277 | 6123 | 1.5 |
7000 | 2560 | 2560 | NN | 10282 | 10131 | 1.0 | 11027 | 9974 | 1.1 |
16 | 4096 | 4096 | NN | 2695 | 1110 | 2.4 | 4442 | 266 | 16.7 |
32 | 4096 | 4096 | NN | 5266 | 2264 | 2.3 | 7723 | 758 | 10.2 |
64 | 4096 | 4096 | NN | 6942 | 3922 | 1.8 | 8904 | 1055 | 8.4 |
128 | 4096 | 4096 | NN | 8127 | 5686 | 1.4 | 9711 | 5681 | 1.7 |
7000 | 4096 | 4096 | NN | 10462 | 10082 | 1.0 | 11152 | 9991 | 1.1 |
16 | 1760 | 1760 | NT | 1719 | 1095 | 1.6 | 2692 | 290 | 9.3 |
32 | 1760 | 1760 | NT | 3316 | 1312 | 2.5 | 5068 | 447 | 11.3 |
64 | 1760 | 1760 | NT | 5247 | 1955 | 2.7 | 7621 | 1797 | 4.2 |
128 | 1760 | 1760 | NT | 6720 | 3393 | 2.0 | 8886 | 3342 | 2.7 |
7000 | 1760 | 1760 | NT | 9341 | 8513 | 1.1 | 10085 | 9635 | 1.0 |
16 | 2048 | 2048 | NT | 2442 | 1231 | 2.0 | 3641 | 299 | 12.2 |
32 | 2048 | 2048 | NT | 4801 | 1251 | 3.8 | 5849 | 468 | 12.5 |
64 | 2048 | 2048 | NT | 6317 | 1967 | 3.2 | 7825 | 3128 | 2.5 |
128 | 2048 | 2048 | NT | 7176 | 5041 | 1.4 | 8616 | 4843 | 1.8 |
7000 | 2048 | 2048 | NT | 9975 | 9173 | 1.1 | 10741 | 9560 | 1.1 |
16 | 2560 | 2560 | NT | 1834 | 1208 | 1.5 | 3154 | 297 | 10.6 |
32 | 2560 | 2560 | NT | 3610 | 1436 | 2.5 | 5418 | 584 | 9.3 |
64 | 2560 | 2560 | NT | 6083 | 2815 | 2.2 | 8331 | 1042 | 8.0 |
128 | 2560 | 2560 | NT | 7702 | 3246 | 2.4 | 8857 | 5259 | 1.7 |
7000 | 2560 | 2560 | NT | 9257 | 7829 | 1.2 | 10659 | 9548 | 1.1 |
16 | 4096 | 4096 | NT | 2546 | 1297 | 2.0 | 4164 | 309 | 13.5 |
32 | 4096 | 4096 | NT | 4992 | 2290 | 2.2 | 8156 | 775 | 10.5 |
64 | 4096 | 4096 | NT | 6746 | 4157 | 1.6 | 8429 | 1381 | 6.1 |
128 | 4096 | 4096 | NT | 7843 | 5425 | 1.4 | 9298 | 5527 | 1.7 |
7000 | 4096 | 4096 | NT | 9925 | 6879 | 1.4 | 10630 | 9784 | 1.1 |
7133 | 1760 | 1760 | TN | 9752 | 10186 | 1.0 | 10517 | 8912 | 1.2 |
7133 | 2048 | 2048 | TN | 10485 | 10319 | 1.0 | 10674 | 9608 | 1.1 |
7133 | 2560 | 2560 | TN | 10743 | 11057 | 1.0 | 11195 | 10059 | 1.1 |
7133 | 4096 | 4096 | TN | 10384 | 10290 | 1.0 | 10980 | 10558 | 1.0 |
9124 | 5124 | 1760 | NN | 9920 | 9480 | 1.0 | 10580 | 9743 | 1.1 |
9124 | 5124 | 2048 | NN | 10008 | 9415 | 1.1 | 10602 | 9796 | 1.1 |
9124 | 5124 | 2560 | NN | 9925 | 9426 | 1.1 | 10586 | 9850 | 1.1 |
9124 | 5124 | 4096 | NN | 9982 | 9489 | 1.1 | 10580 | 9472 | 1.1 |
9124 | 5124 | 1760 | NT | 9093 | 3497 | 2.6 | 9302 | 8692 | 1.1 |
9124 | 5124 | 2048 | NT | 9506 | 6512 | 1.5 | 9506 | 8883 | 1.1 |
9124 | 5124 | 2560 | NT | 8704 | 3364 | 2.6 | 9855 | 7733 | 1.3 |
9124 | 5124 | 4096 | NT | 9733 | 6109 | 1.6 | 10278 | 8760 | 1.2 |
8457 | 35 | 1760 | NN | 3343 | 1020 | 3.3 | 3841 | 736 | 5.2 |
8457 | 35 | 2048 | NN | 3419 | 1996 | 1.7 | 4782 | 803 | 6.0 |
8457 | 35 | 2560 | NN | 3415 | 1072 | 3.2 | 3868 | 789 | 4.9 |
8457 | 35 | 4096 | NN | 3743 | 2009 | 1.9 | 4741 | 804 | 5.9 |
8457 | 35 | 1760 | NT | 3574 | 1970 | 1.8 | 4176 | 1243 | 3.4 |
8457 | 35 | 2048 | NT | 4564 | 3069 | 1.5 | 4818 | 1255 | 3.8 |
8457 | 35 | 2560 | NT | 3598 | 2062 | 1.7 | 3597 | 1135 | 3.2 |
8457 | 35 | 4096 | NT | 4311 | 2990 | 1.4 | 4927 | 1303 | 3.8 |
16 | 7680 | 2560 | NN | 2683 | 718 | 3.7 | 4449 | 289 | 15.4 |
32 | 7680 | 2560 | NN | 5304 | 3660 | 1.4 | 7837 | 979 | 8.0 |
64 | 7680 | 2560 | NN | 7311 | 4979 | 1.5 | 9310 | 1274 | 7.3 |
128 | 7680 | 2560 | NN | 7931 | 6109 | 1.3 | 9390 | 6591 | 1.4 |
16 | 7680 | 2560 | NT | 1885 | 1191 | 1.6 | 3401 | 290 | 11.7 |
32 | 7680 | 2560 | NT | 3731 | 1808 | 2.1 | 6373 | 1004 | 6.3 |
64 | 7680 | 2560 | NT | 6274 | 3509 | 1.8 | 8809 | 1655 | 5.3 |
128 | 7680 | 2560 | NT | 7957 | 2988 | 2.7 | 9246 | 4695 | 2.0 |
16 | 3072 | 1024 | NN | 2277 | 1295 | 1.8 | 3373 | 282 | 12.0 |
32 | 3072 | 1024 | NN | 4494 | 1798 | 2.5 | 6011 | 807 | 7.4 |
64 | 3072 | 1024 | NN | 6272 | 3046 | 2.1 | 6790 | 917 | 7.4 |
128 | 3072 | 1024 | NN | 7364 | 5436 | 1.4 | 7768 | 5749 | 1.4 |
16 | 3072 | 1024 | NT | 2285 | 1077 | 2.1 | 3439 | 244 | 14.1 |
32 | 3072 | 1024 | NT | 4597 | 1540 | 3.0 | 5645 | 677 | 8.3 |
64 | 3072 | 1024 | NT | 6392 | 2969 | 2.2 | 7555 | 1204 | 6.3 |
128 | 3072 | 1024 | NT | 7460 | 5058 | 1.5 | 8586 | 5535 | 1.6 |
7435 | 3072 | 1024 | TN | 9829 | 8804 | 1.1 | 10123 | 9365 | 1.1 |
5481 | 7680 | 2560 | TN | 9448 | 9309 | 1.0 | 9466 | 9394 | 1.0 |