GemmEx of float16 runs so fast but with wrong result
Closed this issue · 12 comments
What is the expected behavior
- The performance(e.g. running time) of GEMM in float32 and float16 should be different in a reasonable way. And the result matrix should be close.
What actually happens
- The run time of GemmEx in fp16 or Hgemm is too short, like 0.02 ms.
- The result of GemmEx in fp16 or Hgemm is too different( code dong the check is inside the file too)
How to reproduce
- Code is below
-
compile command: hipcc fp16gemm.hip -o fp16_gemm -lhipblas -L/opt/rocm-4.3.0/hipblas/lib
-
execute: ./fp16_gemm
-
Output is like this:
The error analysis may have some problem(I am still working on it), but the run time of Gemm of fp16 or Hgemm is too few to be real.
- Matrix Data I used the gen_data.py to generate by Python3.8
Environment
Hardware | description |
---|---|
GPU | MI100 |
CPU | AMD EPYC 7302 16-Core Processor |
Software | version |
---|---|
ROCm | v4.3.0 |
HipCC | v4.3 |
RocBlas | v4.3 |
Also, is there any software I need to install? I had rocm-4.3.0 before, but I didn't have rocblas-4.3.0. So I compiled this code after I installed rocblas-4.3.0 and rocsolver-4.3.0, are there any softwares I didn't install which cause such a problem?
Here is a simple reproducation(headers included are the same as those in code.zip):
#define HIP_ASSERT(x) (assert((x)==hipSuccess))
#define HIP_CHECK(x) \
if(x != hipSuccess) \
{ \
std::cout<<"Wrong HIP"<<std::endl; \
exit(1); \
}
#define HIP_BLAS_CHECK(x) \
if(x != HIPBLAS_STATUS_SUCCESS) \
{ \
std::cout<<"Wrong HIP BLAS: "; \
exit(1); \
}
#define HIP_LASTERROR() \
{ \
std::cout<<"ERROR: "<<hipGetErrorString(hipGetLastError())<<std::endl; \
}
char *ReadFile(const std::string &filePath, size_t &fileSize, void *buffer, size_t bufferSize)
{
struct stat sBuf;
int fileStatus = stat(filePath.data(), &sBuf);
if (fileStatus == -1) {
std::cout<<"failed to get file"<<std::endl;
return nullptr;
}
if (S_ISREG(sBuf.st_mode) == 0) {
std::cout<<filePath.c_str()<<" is not a file, please enter a file"<<std::endl;
return nullptr;
}
std::ifstream file;
file.open(filePath, std::ios::binary);
if (!file.is_open()) {
std::cout<<"Open file failed. path "<<filePath.c_str()<<std::endl;
return nullptr;
}
std::filebuf *buf = file.rdbuf();
size_t size = buf->pubseekoff(0, std::ios::end, std::ios::in);
if (size == 0) {
std::cout<<"file size is 0"<<std::endl;
file.close();
return nullptr;
}
if (size > bufferSize) {
std::cout<<"file size = "<<size<<" is larger than buffer size = "<<bufferSize<<std::endl;
file.close();
return nullptr;
}
buf->pubseekpos(0, std::ios::in);
buf->sgetn(static_cast<char *>(buffer), size);
fileSize = size;
file.close();
return static_cast<char *>(buffer);
}
bool WriteFile(const std::string &filePath, const void *buffer, size_t size)
{
if (buffer == nullptr) {
//ERROR_LOG("Write file failed. buffer is nullptr");
return false;
}
FILE *outputFile = fopen(filePath.c_str(), "wb");
if (outputFile == nullptr) {
//ERROR_LOG("Open file failed. path = %s", filePath.c_str());
return false;
}
fwrite(buffer, size, sizeof(char), outputFile);
fclose(outputFile);
return true;
}
int main(int argc, char** argv) {
//size_t M = 1024;
//size_t K = 1024;
//size_t N = 1024;
size_t M = 8192;
size_t K = 8192;
size_t N = 8192;
//host data
size_t fp16_bytes = sizeof(hipblasHalf) * M * N;
size_t fp32_bytes = sizeof(float) * M * N;
unsigned short * fp16_h_A = (unsigned short *)malloc(fp16_bytes);
unsigned short * fp16_h_B = (unsigned short *)malloc(fp16_bytes);
unsigned short * fp16_h_C = (unsigned short *)malloc(fp16_bytes);
float * fp32_h_C = (float*)malloc(fp32_bytes);
//device data
//fp32
float * fp32_d_C;
//fp16
hipblasHalf * fp16_d_A;
hipblasHalf * fp16_d_B;
hipblasHalf * fp16_d_C;
std::cout<<"Begin to malloc"<<std::endl;
HIP_ASSERT(hipMalloc((void**)&fp16_d_A, fp16_bytes));
HIP_ASSERT(hipMalloc((void**)&fp16_d_B, fp16_bytes));
HIP_ASSERT(hipMalloc((void**)&fp16_d_C, fp16_bytes));
HIP_ASSERT(hipMalloc((void**)&fp32_d_C, fp32_bytes));
HIP_LASTERROR();
double msecPerMatrixMul[3] = {0, 0, 0};
double gigaFlops[3] = {0, 0, 0};
double flopsPerMatrixMul = 2.0 * M * N * K;
size_t fileSize;
char * fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_a.bin",fileSize,fp16_h_A,fp16_bytes);
if(fileData == nullptr)
{
std::cout<<"matrix A read file fail"<<std::endl;
exit(1);
}
fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_b.bin",fileSize,fp16_h_B,fp16_bytes);
if(fileData == nullptr)
{
std::cout<<"matrix B read file fail"<<std::endl;
exit(1);
}
fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_c.bin",fileSize,fp16_h_C,fp16_bytes);
if(fileData == nullptr)
{
std::cout<<"matrix C read file fail"<<std::endl;
exit(1);
}
fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp32matrix_c.bin",fileSize,fp32_h_C,fp32_bytes);
if(fileData == nullptr)
{
std::cout<<"matrix B read file fail"<<std::endl;
exit(1);
}
HIP_ASSERT(hipMemcpy(fp16_d_A,fp16_h_A,fp16_bytes,hipMemcpyHostToDevice));
HIP_ASSERT(hipMemcpy(fp16_d_B,fp16_h_B,fp16_bytes,hipMemcpyHostToDevice));
//HIP_ASSERT(hipMemcpy(fp16_d_C,fp16_h_C,fp16_bytes,hipMemcpyHostToDevice));
HIP_ASSERT(hipMemcpy(fp32_d_C,fp32_h_C,fp32_bytes,hipMemcpyHostToDevice));
std::cout<<"original host fp16 Matrix 100th element"<<(fp16_h_A[100])<<std::endl;
HIP_ASSERT(hipMemcpy(fp16_h_A,fp16_d_A,fp16_bytes,hipMemcpyDeviceToHost));
std::cout<<"translated host fp16 Matrix 100th element"<<(fp16_h_A[100])<<std::endl;
HIP_LASTERROR();
hipEvent_t start,stop;
hipEventCreate(&start);
hipEventCreate(&stop);
float msecTotal = 0;
int nIter = 1;
std::cout<<"begin GEMM"<<std::endl;
// FP32 GEMM
hipblasHandle_t blas_handle;
hipblasCreate(&blas_handle);
//FP16
float fp32_alpha = 1;
float fp32_beta = 1;
hipblasHandle_t fp16blas_handle;
hipblasCreate(&fp16blas_handle);
hipblasHalf fp16_alpha = static_cast<hipblasHalf>(1);
hipblasHalf fp16_beta = static_cast<hipblasHalf>(1);
hipblasHalf * device_fp16_alpha = nullptr;
hipblasHalf * device_fp16_beta = nullptr;
HIP_ASSERT(hipMalloc((void**)&device_fp16_alpha, sizeof(hipblasHalf)));
HIP_ASSERT(hipMalloc((void**)&device_fp16_beta, sizeof(hipblasHalf)));
HIP_ASSERT(hipMemcpy(device_fp16_beta, &fp16_beta,sizeof(hipblasHalf),hipMemcpyHostToDevice));
HIP_ASSERT(hipMemcpy(device_fp16_alpha, &fp16_alpha,sizeof(hipblasHalf),hipMemcpyHostToDevice));
HIP_LASTERROR();
HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
fp16_d_C,M));
hipDeviceSynchronize();
HIP_LASTERROR();
hipEventRecord(start);
for (int run = 0 ; run < nIter; run ++ ) {
/*
//fp16->fp16
HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
fp16_d_C,M));
*/
HIP_BLAS_CHECK(hipblasGemmEx(blas_handle, HIPBLAS_OP_T, HIPBLAS_OP_T,
M,N,K, &fp32_alpha,
fp16_d_A, HIPBLAS_R_16F, M,
fp16_d_B, HIPBLAS_R_16F, K,
&fp32_beta, fp32_d_C, HIPBLAS_R_32F, M,
HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT));
}
hipEventRecord(stop);
hipEventSynchronize(stop);
hipEventElapsedTime(&msecTotal, start, stop);
HIP_LASTERROR();
msecPerMatrixMul[1] = msecTotal / nIter;
gigaFlops[1] = (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul[1] / 1000.0f);
printf( "CuBlas FP16 Performance= %.2f GFlop/s, Time= %.3f msec, Size= %.0f Ops,\n",
gigaFlops[1],
msecPerMatrixMul[1],
flopsPerMatrixMul);
hipblasDestroy(blas_handle);
hipblasDestroy(fp16blas_handle);
// Free Memory
hipFree(fp16_d_A);
hipFree(fp16_d_B);
hipFree(fp16_d_C);
free(fp16_h_A);
free(fp16_h_B);
free(fp16_h_C);
}
I implemented a cpu-version function to check the error, but I found the error between GemmEx of fp16 and self-implemented fp16gemm is too big. I cannot figure out why, please provide some suggestions, thank you!
10_4 checking.zip
Hi there, thanks for raising this issue.
It seems to me that the problems lies with alpha/beta initialization. When running the sample with ROCBLAS_LAYER=2
it appears that both alpha
and beta
are not equal to 1 as expected, but rather 0.
I changed alpha
and beta
in the zipped code to be of type _Float16
, casting to const hipblasHalf*
when passing into hipblasGemmEx
, and it seemed to work as expected. Please let me know if this solution works for you as well.
Thanks,
Daine
Hi there, thanks for raising this issue.
It seems to me that the problems lies with alpha/beta initialization. When running the sample with
ROCBLAS_LAYER=2
it appears that bothalpha
andbeta
are not equal to 1 as expected, but rather 0.I changed
alpha
andbeta
in the zipped code to be of type_Float16
, casting toconst hipblasHalf*
when passing intohipblasGemmEx
, and it seemed to work as expected. Please let me know if this solution works for you as well.Thanks, Daine
Thanks for your reply. In fact, I found out that if I called GemmEx of fp16/fp16 with float alpha and beta as well as both HIPBLAS_OP_N, the result is right and performance is like 130 TFLOPS under problem size of m=n=k=16384. But if without satisfying above condition, the performance varies a lot.
An experiment calculating average per-time performance of running GEMM 100 times under different settings.
And also, like your solution, I tried the code below a long time ago:
hipblasHalf * device_fp16_alpha = nullptr;
hipblasHalf * device_fp16_beta = nullptr;
HIP_ASSERT(hipMalloc((void**)&device_fp16_alpha, sizeof(hipblasHalf)));
HIP_ASSERT(hipMalloc((void**)&device_fp16_beta, sizeof(hipblasHalf)));
HIP_ASSERT(hipMemcpy(device_fp16_beta, &fp16_beta,sizeof(hipblasHalf),hipMemcpyHostToDevice));
HIP_ASSERT(hipMemcpy(device_fp16_alpha, &fp16_alpha,sizeof(hipblasHalf),hipMemcpyHostToDevice));
HIP_LASTERROR();
float Peak_ms = 1000000;
for (int run = 0 ; run < nIter; run ++ ) {
//fp16->fp16
HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
fp16_d_C,M));
}
It doesn't produce right result too. In my opinion, this code should work like your solution. I used to transport the alpha/beta value on device to host and check their value, they are 1.0, but when passing to hipblasGemm, they become 0. Or maybe can you provide a small code example to reproduce what you said? Thank you.
So I will really appreciate that AMD can provide a detailed guide to use hipblasGemm, because as the way I see, these problems are caused by the non-full software support in rocBlas.
I will make a tool library today to run all these test in a more convenient way, which combines running Gemm and result checking function. Maybe in this way we can produce a table telling people how we should use hipblasGEMM in a right way.
Just some comments, your code pasted above seems buggy:
It passes device pointers as arguments for alpha/beta (addresses on GPU) without setting:
status = hipblasSetPointerMode(fp16blas_handle, HIPBLAS_POINTER_MODE_DEVICE);
as the default would be as if you set:
status = hipblasSetPointerMode(handle, HIPBLAS_POINTER_MODE_HOST);
You seem to initialize two handles differently and not clear if you are using the one you intended, simpler to use
one handle and reconfigure it if required for the secondary call as you are in a single thread.
Your numeric anlaysis code for f32 notes row and col major but divides by a different indexing scheme, maybe you intended column major and made a mistake. Nor is this stable and useful where test data approaches zero.
float cur_err = cm_matrix[jm+i] - rm_matrix[in+j];
if(abs(cur_err) > abs(max_err)) max_err = abs(cur_err);
cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major
Thus please try using a single handle configured for how you will pass arguments (host or device mode) and compare results against CPU reference code of either LAPACK, BLIS, or OpenBLAS which are ones we test against. This can rule out bugs in your kernel implementation. You made a good point we should have more examples, note some additional examples for the underlying rocBLAS library are at: https://github.com/ROCmSoftwarePlatform/rocBLAS-examples also the explanation of handles from rocBLAS is at https://rocblas.readthedocs.io/en/rocm-4.3.1/
Just some comments, your code pasted above seems buggy:
It passes device pointers as arguments for alpha/beta (addresses on GPU) without setting: status = hipblasSetPointerMode(fp16blas_handle, HIPBLAS_POINTER_MODE_DEVICE); as the default would be as if you set: status = hipblasSetPointerMode(handle, HIPBLAS_POINTER_MODE_HOST); You seem to initialize two handles differently and not clear if you are using the one you intended, simpler to use one handle and reconfigure it if required for the secondary call as you are in a single thread.
Your numeric anlaysis code for f32 notes row and col major but divides by a different indexing scheme, maybe you intended column major and made a mistake. Nor is this stable and useful where test data approaches zero. float cur_err = cm_matrix[j_m+i] - rm_matrix[i_n+j]; if(abs(cur_err) > abs(max_err)) max_err = abs(cur_err); cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major
Thus please try using a single handle configured for how you will pass arguments (host or device mode) and compare results against CPU reference code of either LAPACK, BLIS, or OpenBLAS which are ones we test against. This can rule out bugs in your kernel implementation. You made a good point we should have more examples, note some additional examples for the underlying rocBLAS library are at: https://github.com/ROCmSoftwarePlatform/rocBLAS-examples also the explanation of handles from rocBLAS is at https://rocblas.readthedocs.io/en/rocm-4.3.1/
Thank you for your reply. Right, your suggestion about pointer setting is something I missed before, thank you for that. As for the cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major
, considering in my cases M = N = K is always true, so this does not matter.
As for this
Nor is this stable and useful where test data approaches zero.
I totally agree with you, and I implemented the version without using abs(), it works fine.
I need some time to work on the whole new test. Thank you.
I've added an hgemm example at example_hgemm.cpp which can be compiled with hipcc; I hope that helps as well.
I've added an hgemm example at example_hgemm.cpp which can be compiled with hipcc; I hope that helps as well.
Thank you so much for that. It did help me.