Possible test data corruption for rocSOLVER (version rocm 4.3.0)
littlewu2508 opened this issue · 4 comments
I'm packaging rocSOLVER-4.3.0 for Gentoo and testing the package using openblas as CPU reference when I encountered a segmentation fault, detailed information here. Choosing netlib lapack as reference won't have issues, so at first I suppose there is a bug in openblas; but after short investigation with openblas team, I found that the input data to reference lapack seems corrupted and causes the segfault (somehow netlib runs normally even with the same, suspicious input). So I turned here for more help on finding the reason and solution.
My environment:
- kernel(GPU driver): Linux 5.15.8
- compiler: gcc-11.2; llvm-rocm-4.3.0 + hipcc-4.3.0
- ROCm: all 4.3.0, gentoo packages
- googletest: 1.11.0
- openblas: 0.3.19
- netlib lapack: 3.10.0
Thanks for taking a look at this. I've been wanting to do some testing with OpenBLAS.
However, I don't understand what the actual problem is. I'm not deeply familiar with this particular function, but the way you're printing ipiv looks wrong to me:
for(int i=0;i<k2;i++)
{
for(int j=0;j<2;j++)
{
printf("%d,", piv[(i*2+j)]);
}
}
ipiv for dlaswp is documented as being of dimension (K1+(K2-K1)*abs(INCX))
. So for
N = 192
LDA = 100
K1 = 20
K2 = 100
INCX = -2
then
K1+(K2-K1)*abs(INCX) = 180
however the maximum value of i*2+j
is when i=k2-1
and j=1
, so you access up to index 2*k2-1
. That's index 199, which is out-of-bounds.
Oh you're right, I make a mistake here (I thought it is (K1+(K2-K1))*abs(INCX))
).
So ipiv is 60,70,100,10,70,60,30,70,60,70,90,60,70,100,20,90,100,40,70,60,30,70,100,10,50,70,90,70,10,20,80,40,80,50,70,70,30,80,100,60,80,70,40,40,70,70,40,100,90,50,80,10,10,80,60,90,100,90,90,90,30,70,10,90,40,80,20,100,60,50,70,90,40,10,100,60,70,90,60,20,30,70,20,60,10,20,30,50,10,100,100,10,90,70,30,40,40,60,30,100,80,90,50,20,70,60,20,60,80,80,100,60,80,30,30,30,80,70,70,20,90,100,10,100,30,100,60,100,10,70,10,100,60,20,40,50,80,90,90,30,30,10,50,60,90,100,80,80,30,10,30,100,10,40,70,60,20,90,10,20,20,100,30,10,90,20,20,70,100,40,10,100,30,70,80,90,60,20,80,10
https://github.com/littlewu2508/test_lapack_laswp I make a repo for reproduction. It seems that the memory trailing ipiv impact openblas, which is strange.
Openblas fixed the issue via OpenMathLib/OpenBLAS#3514