GemmEx of float16 runs so fast but with wrong result

Question

GemmEx of float16 runs so fast but with wrong result

Closed this issue 3 years ago · 12 comments

MoZeWei commented 3 years ago

What is the expected behavior

The performance(e.g. running time) of GEMM in float32 and float16 should be different in a reasonable way. And the result matrix should be close.

What actually happens

The run time of GemmEx in fp16 or Hgemm is too short, like 0.02 ms.
The result of GemmEx in fp16 or Hgemm is too different( code dong the check is inside the file too)

How to reproduce

Code is below

code.zip

compile command: hipcc fp16gemm.hip -o fp16_gemm -lhipblas -L/opt/rocm-4.3.0/hipblas/lib
execute: ./fp16_gemm
Output is like this:

The error analysis may have some problem(I am still working on it), but the run time of Gemm of fp16 or Hgemm is too few to be real.

Matrix Data I used the gen_data.py to generate by Python3.8

Environment

Hardware	description
GPU	MI100
CPU	AMD EPYC 7302 16-Core Processor

Software	version
ROCm	v4.3.0
HipCC	v4.3
RocBlas	v4.3

Answer 1 · 2021-10-03T02:30:48.000Z

Also, is there any software I need to install? I had rocm-4.3.0 before, but I didn't have rocblas-4.3.0. So I compiled this code after I installed rocblas-4.3.0 and rocsolver-4.3.0, are there any softwares I didn't install which cause such a problem?

Answer 2 · 2021-10-03T06:50:29.000Z

Here is a simple reproducation(headers included are the same as those in code.zip):

#define HIP_ASSERT(x) (assert((x)==hipSuccess))



#define HIP_CHECK(x)                        \
    if(x != hipSuccess)                     \
    {                                       \
        std::cout<<"Wrong HIP"<<std::endl;  \
        exit(1);                            \
    }

#define HIP_BLAS_CHECK(x)                           \
    if(x != HIPBLAS_STATUS_SUCCESS)                 \
    {                                               \
        std::cout<<"Wrong HIP BLAS: ";              \
        exit(1);                                    \
    }

#define HIP_LASTERROR()                     \
    {                                       \
        std::cout<<"ERROR: "<<hipGetErrorString(hipGetLastError())<<std::endl;  \
    }

char *ReadFile(const std::string &filePath, size_t &fileSize, void *buffer, size_t bufferSize)
{
    struct stat sBuf;
    int fileStatus = stat(filePath.data(), &sBuf);
    if (fileStatus == -1) {
        std::cout<<"failed to get file"<<std::endl;
        return nullptr;
    }
    if (S_ISREG(sBuf.st_mode) == 0) {
        std::cout<<filePath.c_str()<<" is not a file, please enter a file"<<std::endl;
        return nullptr;
    }

    std::ifstream file;
    file.open(filePath, std::ios::binary);
    if (!file.is_open()) {
        std::cout<<"Open file failed. path "<<filePath.c_str()<<std::endl;
        return nullptr;
    }

    std::filebuf *buf = file.rdbuf();
    size_t size = buf->pubseekoff(0, std::ios::end, std::ios::in);
    if (size == 0) {
        std::cout<<"file size is 0"<<std::endl;
        file.close();
        return nullptr;
    }
    if (size > bufferSize) {
        std::cout<<"file size = "<<size<<" is larger than buffer size = "<<bufferSize<<std::endl;
        file.close();
        return nullptr;
    }
    buf->pubseekpos(0, std::ios::in);
    buf->sgetn(static_cast<char *>(buffer), size);
    fileSize = size;
    file.close();
    return static_cast<char *>(buffer);
}

bool WriteFile(const std::string &filePath, const void *buffer, size_t size)
{
    if (buffer == nullptr) {
        //ERROR_LOG("Write file failed. buffer is nullptr");
        return false;
    }

    FILE *outputFile = fopen(filePath.c_str(), "wb");
    if (outputFile == nullptr) {
        //ERROR_LOG("Open file failed. path = %s", filePath.c_str());
        return false;
    }

    fwrite(buffer, size, sizeof(char), outputFile);
    fclose(outputFile);

    return true;
}


int main(int argc, char** argv) {


    //size_t M = 1024;
    //size_t K = 1024;
    //size_t N = 1024;
    size_t M = 8192;
    size_t K = 8192;
    size_t N = 8192;

    //host data
    size_t fp16_bytes = sizeof(hipblasHalf) * M * N;
    size_t fp32_bytes = sizeof(float) * M * N;
    unsigned short * fp16_h_A = (unsigned short *)malloc(fp16_bytes);
    unsigned short * fp16_h_B = (unsigned short *)malloc(fp16_bytes);
    unsigned short * fp16_h_C = (unsigned short *)malloc(fp16_bytes);
    float * fp32_h_C = (float*)malloc(fp32_bytes);

    //device data
    //fp32 
    float * fp32_d_C;

    //fp16
    hipblasHalf * fp16_d_A;
    hipblasHalf * fp16_d_B;
    hipblasHalf * fp16_d_C;



    std::cout<<"Begin to malloc"<<std::endl;


    HIP_ASSERT(hipMalloc((void**)&fp16_d_A, fp16_bytes));
    HIP_ASSERT(hipMalloc((void**)&fp16_d_B, fp16_bytes));
    HIP_ASSERT(hipMalloc((void**)&fp16_d_C, fp16_bytes));
    HIP_ASSERT(hipMalloc((void**)&fp32_d_C, fp32_bytes));
    
    HIP_LASTERROR();

    double msecPerMatrixMul[3] = {0, 0, 0};
    double gigaFlops[3] = {0, 0, 0};
    double flopsPerMatrixMul = 2.0 * M * N * K;

    size_t fileSize;
    char * fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_a.bin",fileSize,fp16_h_A,fp16_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix A read file fail"<<std::endl;
        exit(1);
    }
    fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_b.bin",fileSize,fp16_h_B,fp16_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix B read file fail"<<std::endl;
        exit(1);
    }
    fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp16matrix_c.bin",fileSize,fp16_h_C,fp16_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix C read file fail"<<std::endl;
        exit(1);
    }
    fileData = ReadFile("/mnt/data/home/mzw/workspace/matrix_data/fp32matrix_c.bin",fileSize,fp32_h_C,fp32_bytes);
    if(fileData == nullptr)
    {
        std::cout<<"matrix B read file fail"<<std::endl;
        exit(1);
    }

    HIP_ASSERT(hipMemcpy(fp16_d_A,fp16_h_A,fp16_bytes,hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(fp16_d_B,fp16_h_B,fp16_bytes,hipMemcpyHostToDevice));
    //HIP_ASSERT(hipMemcpy(fp16_d_C,fp16_h_C,fp16_bytes,hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(fp32_d_C,fp32_h_C,fp32_bytes,hipMemcpyHostToDevice));

    std::cout<<"original host fp16 Matrix 100th element"<<(fp16_h_A[100])<<std::endl;
    HIP_ASSERT(hipMemcpy(fp16_h_A,fp16_d_A,fp16_bytes,hipMemcpyDeviceToHost));
    std::cout<<"translated host fp16 Matrix 100th element"<<(fp16_h_A[100])<<std::endl;

    HIP_LASTERROR();

    hipEvent_t start,stop;
    hipEventCreate(&start);
    hipEventCreate(&stop);
    float msecTotal = 0;
    int nIter = 1;
    std::cout<<"begin GEMM"<<std::endl;
	
    // FP32 GEMM
    hipblasHandle_t blas_handle;  
    hipblasCreate(&blas_handle);

    //FP16
    float  fp32_alpha = 1; 
    float fp32_beta = 1;
    hipblasHandle_t fp16blas_handle;  
    hipblasCreate(&fp16blas_handle);
    hipblasHalf fp16_alpha = static_cast<hipblasHalf>(1);
    hipblasHalf fp16_beta = static_cast<hipblasHalf>(1);
    hipblasHalf * device_fp16_alpha = nullptr;
    hipblasHalf * device_fp16_beta = nullptr;
    HIP_ASSERT(hipMalloc((void**)&device_fp16_alpha, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMalloc((void**)&device_fp16_beta, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMemcpy(device_fp16_beta, &fp16_beta,sizeof(hipblasHalf),hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(device_fp16_alpha, &fp16_alpha,sizeof(hipblasHalf),hipMemcpyHostToDevice));

    HIP_LASTERROR();
    
    HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
					            M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
					            fp16_d_C,M));
    hipDeviceSynchronize();
    
    HIP_LASTERROR();

    hipEventRecord(start);
    for (int run = 0 ; run < nIter; run ++ ) {
        /*
        //fp16->fp16
		HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
					                M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
					                fp16_d_C,M));
        */
        HIP_BLAS_CHECK(hipblasGemmEx(blas_handle, HIPBLAS_OP_T, HIPBLAS_OP_T,
                                    M,N,K, &fp32_alpha,
                                    fp16_d_A, HIPBLAS_R_16F, M,
                                    fp16_d_B, HIPBLAS_R_16F, K,
                                    &fp32_beta, fp32_d_C, HIPBLAS_R_32F, M,
                                    HIPBLAS_R_16F, HIPBLAS_GEMM_DEFAULT));
		
	}
    hipEventRecord(stop);
    hipEventSynchronize(stop);
    hipEventElapsedTime(&msecTotal, start, stop);
    
    HIP_LASTERROR();
    
    msecPerMatrixMul[1] = msecTotal / nIter;
    gigaFlops[1] = (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul[1] / 1000.0f);
    printf( "CuBlas FP16 Performance= %.2f GFlop/s, Time= %.3f msec, Size= %.0f Ops,\n",
        gigaFlops[1],
        msecPerMatrixMul[1],
        flopsPerMatrixMul);



    hipblasDestroy(blas_handle); 
    hipblasDestroy(fp16blas_handle); 

    // Free Memory
    hipFree(fp16_d_A);
    hipFree(fp16_d_B);
    hipFree(fp16_d_C);

    
    free(fp16_h_A);
    free(fp16_h_B);
    free(fp16_h_C);
}

Answer 3 · 2021-10-04T10:50:39.000Z

I implemented a cpu-version function to check the error, but I found the error between GemmEx of fp16 and self-implemented fp16gemm is too big. I cannot figure out why, please provide some suggestions, thank you!
10_4 checking.zip

Answer 4 · 2021-10-14T17:45:42.000Z

Hi there, thanks for raising this issue.

It seems to me that the problems lies with alpha/beta initialization. When running the sample with ROCBLAS_LAYER=2 it appears that both alpha and beta are not equal to 1 as expected, but rather 0.

I changed alpha and beta in the zipped code to be of type _Float16, casting to const hipblasHalf* when passing into hipblasGemmEx, and it seemed to work as expected. Please let me know if this solution works for you as well.

Thanks,
Daine

Answer 5 · 2021-10-15T01:34:31.000Z

Hi there, thanks for raising this issue.

It seems to me that the problems lies with alpha/beta initialization. When running the sample with ROCBLAS_LAYER=2 it appears that both alpha and beta are not equal to 1 as expected, but rather 0.

I changed alpha and beta in the zipped code to be of type _Float16, casting to const hipblasHalf* when passing into hipblasGemmEx, and it seemed to work as expected. Please let me know if this solution works for you as well.

Thanks, Daine

Thanks for your reply. In fact, I found out that if I called GemmEx of fp16/fp16 with float alpha and beta as well as both HIPBLAS_OP_N, the result is right and performance is like 130 TFLOPS under problem size of m=n=k=16384. But if without satisfying above condition, the performance varies a lot.

An experiment calculating average per-time performance of running GEMM 100 times under different settings.

And also, like your solution, I tried the code below a long time ago:

    
    hipblasHalf * device_fp16_alpha = nullptr;
    hipblasHalf * device_fp16_beta = nullptr;
    HIP_ASSERT(hipMalloc((void**)&device_fp16_alpha, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMalloc((void**)&device_fp16_beta, sizeof(hipblasHalf)));
    HIP_ASSERT(hipMemcpy(device_fp16_beta, &fp16_beta,sizeof(hipblasHalf),hipMemcpyHostToDevice));
    HIP_ASSERT(hipMemcpy(device_fp16_alpha, &fp16_alpha,sizeof(hipblasHalf),hipMemcpyHostToDevice));
    
    HIP_LASTERROR();
    float Peak_ms = 1000000;
    for (int run = 0 ; run < nIter; run ++ ) {
        
        //fp16->fp16
		HIP_BLAS_CHECK(hipblasHgemm(fp16blas_handle,HIPBLAS_OP_T, HIPBLAS_OP_T,
					                M,N,K,device_fp16_alpha,fp16_d_A,M,fp16_d_B,K,device_fp16_beta,
					                fp16_d_C,M));
     }

It doesn't produce right result too. In my opinion, this code should work like your solution. I used to transport the alpha/beta value on device to host and check their value, they are 1.0, but when passing to hipblasGemm, they become 0. Or maybe can you provide a small code example to reproduce what you said? Thank you.

So I will really appreciate that AMD can provide a detailed guide to use hipblasGemm, because as the way I see, these problems are caused by the non-full software support in rocBlas.

Answer 6 · 2021-10-15T01:46:44.000Z

I will make a tool library today to run all these test in a more convenient way, which combines running Gemm and result checking function. Maybe in this way we can produce a table telling people how we should use hipblasGEMM in a right way.

Answer 7 · 2021-10-15T15:58:50.000Z

Just some comments, your code pasted above seems buggy:

It passes device pointers as arguments for alpha/beta (addresses on GPU) without setting:
status = hipblasSetPointerMode(fp16blas_handle, HIPBLAS_POINTER_MODE_DEVICE);
as the default would be as if you set:
status = hipblasSetPointerMode(handle, HIPBLAS_POINTER_MODE_HOST);
You seem to initialize two handles differently and not clear if you are using the one you intended, simpler to use
one handle and reconfigure it if required for the secondary call as you are in a single thread.

Your numeric anlaysis code for f32 notes row and col major but divides by a different indexing scheme, maybe you intended column major and made a mistake. Nor is this stable and useful where test data approaches zero.
float cur_err = cm_matrix[jm+i] - rm_matrix[in+j];
if(abs(cur_err) > abs(max_err)) max_err = abs(cur_err);
cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major

Thus please try using a single handle configured for how you will pass arguments (host or device mode) and compare results against CPU reference code of either LAPACK, BLIS, or OpenBLAS which are ones we test against. This can rule out bugs in your kernel implementation. You made a good point we should have more examples, note some additional examples for the underlying rocBLAS library are at: https://github.com/ROCmSoftwarePlatform/rocBLAS-examples also the explanation of handles from rocBLAS is at https://rocblas.readthedocs.io/en/rocm-4.3.1/

Answer 8 · 2021-10-16T06:17:00.000Z

Just some comments, your code pasted above seems buggy:

It passes device pointers as arguments for alpha/beta (addresses on GPU) without setting: status = hipblasSetPointerMode(fp16blas_handle, HIPBLAS_POINTER_MODE_DEVICE); as the default would be as if you set: status = hipblasSetPointerMode(handle, HIPBLAS_POINTER_MODE_HOST); You seem to initialize two handles differently and not clear if you are using the one you intended, simpler to use one handle and reconfigure it if required for the secondary call as you are in a single thread.

Your numeric anlaysis code for f32 notes row and col major but divides by a different indexing scheme, maybe you intended column major and made a mistake. Nor is this stable and useful where test data approaches zero. float cur_err = cm_matrix[j_m+i] - rm_matrix[i_n+j]; if(abs(cur_err) > abs(max_err)) max_err = abs(cur_err); cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major

Thus please try using a single handle configured for how you will pass arguments (host or device mode) and compare results against CPU reference code of either LAPACK, BLIS, or OpenBLAS which are ones we test against. This can rule out bugs in your kernel implementation. You made a good point we should have more examples, note some additional examples for the underlying rocBLAS library are at: https://github.com/ROCmSoftwarePlatform/rocBLAS-examples also the explanation of handles from rocBLAS is at https://rocblas.readthedocs.io/en/rocm-4.3.1/

Thank you for your reply. Right, your suggestion about pointer setting is something I missed before, thank you for that. As for the cur_err /= rm_matrix[i*m+j]; << here the indexing is not row nor col major, considering in my cases M = N = K is always true, so this does not matter.

As for this

Nor is this stable and useful where test data approaches zero.

I totally agree with you, and I implemented the version without using abs(), it works fine.

I need some time to work on the whole new test. Thank you.

Answer 9 · 2021-10-20T22:29:25.000Z

I've added an hgemm example at example_hgemm.cpp which can be compiled with hipcc; I hope that helps as well.

Answer 10 · 2021-10-21T01:17:09.000Z

I've added an hgemm example at example_hgemm.cpp which can be compiled with hipcc; I hope that helps as well.

Thank you so much for that. It did help me.

Answer 11 · 2021-11-23T23:49:02.000Z

Hi @MoZeWei, just wondering if your problem is resolved so I can go ahead with closing this issue, thanks.

Answer 12 · 2021-11-24T01:14:56.000Z

@daineAMD Thanks for your reply. Due to the change of env of my server, this problem should not be discussed anymore.