ROCm/rocBLAS

[Bug]: rocBLAS gemm in python-interface produces wrong results when called more than once

paolodalberto opened this issue · 11 comments

Describe the bug

I created a simple python interface. It works using small example but not in an application. I will provide example and code to reproduce the problem and I think it is a sync problem and I need your help.

To Reproduce

I used a docker tensorflow:latest using rocm 5.6
I installed the clients using install.sh to get the clients code and learn how to write a simple interface.

https://github.com/paolodalberto/MatrixFlow/tree/main/GpuInterface

in particular a simple gemm wrapper
https://github.com/paolodalberto/MatrixFlow/blob/main/GpuInterface/gpuinterface.cpp#L425

where I sample the inputs and the outputs to check consistency and I create a single handle, device and prop.

I created a simple case (temp.py) where I call csr_mv, coo_mv, and dgemm. This standalone test works just fine.
But then I play with an application

python Examples/play.py
where I compare the execution of gemm every time. The first call is correct and the second (no matter how I call it is off)

Expected behavior

Then I use the interface in an application and I compare the result to the usual CPU gemm
https://github.com/paolodalberto/MatrixFlow/blob/main/Matrices/matrices.py#L63

root@fastmmw:/matrixflow# python Examples/play.py

    rocblas_status rocblas_dgemm(
        rocblas_handle handle, 
        rocblas_operation transA, 
        rocblas_operation transB, 
        rocblas_int m, 
        rocblas_int n, 
        rocblas_int k, 
        const double *alpha, 
        const double *A, 
        rocblas_int lda, 
        const double *B, 
        rocblas_int ldb, 
        const double *beta, 
        double *C, 
        rocblas_int ldc) 

    rocblas_status rocblas_dgema(
        rocblas_handle handle, 
        rocblas_operation transA, 
        rocblas_operation transB, 
        rocblas_int m, 
        rocblas_int n, 
        rocblas_int k, 
        const double *alpha, 
        const double *A, 
        rocblas_int lda, 
        const double *B, 
        rocblas_int ldb, 
        const double *beta, 
        double *C, 
        rocblas_int ldc) 
rocblas_dgemm( %s, 'n', 'n', %d, %d, %d,  %s, %s, %d, %s, %d,  %s, %s, %d) 
rocblas_dgema( %s, 'n', 'n', %d, %d, %d  %s, %s, %d,  %s, %s, %s,  %s, %d) 

 Fast Matrix Multiplication MxK * KxN -> MxN 
 Introduce --M --K --N
 We show case the rest
compute
	 BLAS  0x3c260b0
0 Device: AMD Radeon VII
	 A 1 2
	 B 1 2
	 C 0 0
	 Data and Initialization Kernel 10.1151
	 Time Kernel 0.39634
	 C <- 173880 347760
	 Read data from  Kernel 0.000689
m, n, k, lda, ldb, ldc = 80, 80, 80, 80, 80, 80, alpha=1, beta=0
(80, 80) 0.0
time 13.049914598464966
(4, 7)
	 BLAS  0x3c260b0
0 Device: AMD Radeon VII
	 A -1640 -1680
	 B 41 82
	 C 0 0
	 Data and Initialization Kernel 0.000169
	 Time Kernel 0.000137
	 C <- -9.01016e+07 -9.22992e+07
	 Read data from  Kernel 6.5e-05
m, n, k, lda, ldb, ldc = 40, 40, 40, 40, 40, 40, alpha=1, beta=0
(40, 40) -20637760000.0

If you execute the last operation using GPU1

 C = Matrix(B)
(Pdb) B = numpy.matrix(rocmgpu.gemm(1,L.A1, L.shape[1], R.A1, R.shape[1]))
	 BLAS  0x1f0ecf0
1 Device: AMD Radeon VII
	 A -1640 -1680
	 B 41 82
	 C 0 0
	 Data and Initialization Kernel 0.223054
	 Time Kernel 4.1e-05
	 C <- 0 0
	 Read data from  Kernel 0.000549
m, n, k, lda, ldb, ldc = 40, 40, 40, 40, 40, 40, alpha=1, beta=0

As you can see the return matrix from the device is off

Log-files

Add full logfiles to help explain your problem.

Environment

Hardware description
CPU device name
GPU device name

The above hardware Table information can be generated by command:

rocminfo | grep Marketing
root@fastmmw:/matrixflow# /opt/rocm/bin/rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65759804(0x3eb6a3c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65759804(0x3eb6a3c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65759804(0x3eb6a3c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-25c430a172e17d3e               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   2816                               
  Internal Node ID:        1                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-23ac796172fd5d6b               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   17408                              
  Internal Node ID:        2                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
Software version
rocm-core v0.0
rocblas v0.0

The above software Table information can be queried with:

Ubuntu/Debian:
dpkg -s rocm-core | grep Version
dpkg -s rocblas | grep Version
Centos/RHEL:
rpm -qa | grep rocm-core
rpm -qa | grep rocblas
SLES:
zypper se -s | grep rocm-core
zypper se -s | grep rocblas
root@fastmmw:/matrixflow# dpkg -s rocm-core | grep Version
Version: 5.6.0.50600-67~20.04
root@fastmmw:/matrixflow# dpkg -s rocblas | grep Version
Version: 3.0.0.50600-67~20.04

Make sure that ROCm is correctly installed and to capture detailed environment information run the following command:

[environment.txt](https://github.com/ROCmSoftwarePlatform/rocBLAS/files/12242823/environment.txt)```

Attach `environment.txt`


### Additional context
Add any other context about the problem here.

At this point I do not really fancy performance but to have a clear understanding how a simple interface could be made. Eventually I am planning to create kernels codes .. but this is a necessary first step. We can discuss internally if you like.

Thank you for your time

I have found another corner case for gemv

(Pdb) n
	 BLAS GEMV 0x239dcf0
0 Device: AMD Radeon VII
	 A 0 0
	 B -52.9725 -34.4124
	 C 0.5 23.375
	 Data and Initialization Kernel 0.223526
m, n  = 72, 64, 64, alpha=1, beta=1
4 Error: rocblas error in line  61
invalid size parameter.
	 Time Kernel 1e-05
	 C <- 0.5 23.375
	 Read data from  Kernel 0.000573
0.5

This time I do some parameters wrong ... which one ?

The matrices Layout for rocblas are Fortran like
I fixed the problem for above for GEMV ... checking what I am doing wrong for GEMM

V = rocmgpu.gemm(0,LL.A.flatten('F'), LL.shape[0],RR.A.flatten(), RR.shape[1],Result.A.flatten(),Result.shape[1])
This will work :)

nice @rkamd
I think I have found out what I am doing wrong :)

I will share the code and consideration tomorrow and I will close the issue.
Paolo

rkamd commented

@paolodalberto, sounds good.

@paolodalberto just to clarify in rocBLAS all matrices are column major memory ordered (unfortunately this is python value 'F' and 'C' was used for row major as they are language based abbreviations). See https://rocmdocs.amd.com/projects/rocBLAS/en/develop/API_Reference_Guide.html in the Note section. You can create your numpy array with this layout https://numpy.org/doc/stable/reference/generated/numpy.array.html as the constructor and other methods like reshape may take the argument order='F' to use this layout. It will be good to document this in your python interface as well.

@TorreZuk yep
it took a little to pivot but I think I managed :)
@rkamd I will close this
as reference my basic interface looks like this
https://github.com/paolodalberto/MatrixFlow/blob/main/GpuInterface/gpuinterface.cpp
https://github.com/paolodalberto/MatrixFlow/blob/main/GpuInterface/procm.py

I have now a simple toy for sparse and dense, GPU and not GPU for GEMV
and GPU and not GPU for GEMM

cheers

please contact me directly if you have any further questions