This project is an APE implementation on NVIDIA GPU using the cuBLAS backend.
APE is a method of emulating high-bitwidth computation with low-bitwidth data types.
For example, APE can use
- GEMM implementations using Tensor Cores with FP32-precision and various representation ranges.
- Auto-adapted algorithm selection that guarantees end-to-end correctness.
- INT16 GEMM implementation using Tensor Cores.
For more details, please see our paper.
mkdir build && cd build
cmake ..
make -j
APE provides a blas-like API, and users only need to include ape.h to use APE to accelerate FP32 applications directly.
void apeGemmFP32(APEHandler apeHandle, ApeTrans transa, ApeTrans transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc, const ApeAlgo algo = APE_ALGO_AUTO);
FP32 GEMM supports
-
APE_ALGO_AUTO: Select the fastest algorithm without overflow.
-
APE_ALGO_AUTO_STRICT: Select the fastest algorithm without overflow and underflow.
-
APE_ALGO_FP32F: Use FP16 emulated FP32. (1-bit precision loss, narrow representation range, overflow may occur.)
-
APE_ALGO_FP32B: Use BF16 emulated FP32. (no precision loss, large representation range, overflow does not occur.)
-
APE_ALGO_FP32T: Use TF32 emulated FP32. (1-bit precision loss, large representation range, overflow does not occur.)
void apeGemmINT16(APEHandler apeHandle, ApeTrans transa, ApeTrans transb, int m, int n, int k, const int16_t *alpha, const int16_t *A, int lda, const int16_t *B, int ldb, const int32_t *beta, int32_t *C, int ldc, ApeAlgo algo = APE_ALGO_AUTO);
INT16 GEMM supports
-
APE_ALGO_AUTO: Select the algorithm without overflow.
-
APE_ALGO_INT16: Use INT8 emulate INT16. (The upper bound is
$32639$ . Native INT16's is$32767$ . Overflow may occur.)
Ma, Zixuan, et al. "Efficiently emulating high-bitwidth computation with low-bitwidth hardware." Proceedings of the 36th ACM International Conference on Supercomputing. 2022.
If you find this work useful in your research, please cite it using the following BibTeX:
@inproceedings{ma2022efficiently,
author = {Ma, Zixuan and Wang, Haojie and Feng, Guanyu and Zhang, Chen and Xie, Lei and He, Jiaao and Chen, Shengqi and Zhai, Jidong},
title = {Efficiently Emulating High-Bitwidth Computation with Low-Bitwidth Hardware},
year = {2022},
isbn = {9781450392815},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3524059.3532377},
doi = {10.1145/3524059.3532377},
booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing},
articleno = {5},
numpages = {12},
keywords = {emulation, tensor core, domain specific accelerator},
location = {Virtual Event},
series = {ICS '22}
}