ROCm/rocSOLVER

Possible test data corruption for rocSOLVER (version rocm 4.3.0)

littlewu2508 opened this issue · 4 comments

I'm packaging rocSOLVER-4.3.0 for Gentoo and testing the package using openblas as CPU reference when I encountered a segmentation fault, detailed information here. Choosing netlib lapack as reference won't have issues, so at first I suppose there is a bug in openblas; but after short investigation with openblas team, I found that the input data to reference lapack seems corrupted and causes the segfault (somehow netlib runs normally even with the same, suspicious input). So I turned here for more help on finding the reason and solution.

My environment:

  • kernel(GPU driver): Linux 5.15.8
  • compiler: gcc-11.2; llvm-rocm-4.3.0 + hipcc-4.3.0
  • ROCm: all 4.3.0, gentoo packages
  • googletest: 1.11.0
  • openblas: 0.3.19
  • netlib lapack: 3.10.0
cgmb commented

Thanks for taking a look at this. I've been wanting to do some testing with OpenBLAS.

However, I don't understand what the actual problem is. I'm not deeply familiar with this particular function, but the way you're printing ipiv looks wrong to me:

for(int i=0;i<k2;i++)
{
    for(int j=0;j<2;j++)
    {
        printf("%d,", piv[(i*2+j)]);
    }
}

ipiv for dlaswp is documented as being of dimension (K1+(K2-K1)*abs(INCX)). So for

N = 192 
LDA = 100 
K1 = 20
K2 = 100 
INCX = -2

then

K1+(K2-K1)*abs(INCX) = 180 

however the maximum value of i*2+j is when i=k2-1 and j=1, so you access up to index 2*k2-1. That's index 199, which is out-of-bounds.

Oh you're right, I make a mistake here (I thought it is (K1+(K2-K1))*abs(INCX))).

So ipiv is 60,70,100,10,70,60,30,70,60,70,90,60,70,100,20,90,100,40,70,60,30,70,100,10,50,70,90,70,10,20,80,40,80,50,70,70,30,80,100,60,80,70,40,40,70,70,40,100,90,50,80,10,10,80,60,90,100,90,90,90,30,70,10,90,40,80,20,100,60,50,70,90,40,10,100,60,70,90,60,20,30,70,20,60,10,20,30,50,10,100,100,10,90,70,30,40,40,60,30,100,80,90,50,20,70,60,20,60,80,80,100,60,80,30,30,30,80,70,70,20,90,100,10,100,30,100,60,100,10,70,10,100,60,20,40,50,80,90,90,30,30,10,50,60,90,100,80,80,30,10,30,100,10,40,70,60,20,90,10,20,20,100,30,10,90,20,20,70,100,40,10,100,30,70,80,90,60,20,80,10

https://github.com/littlewu2508/test_lapack_laswp I make a repo for reproduction. It seems that the memory trailing ipiv impact openblas, which is strange.

Openblas fixed the issue via OpenMathLib/OpenBLAS#3514