Bug in the complex least-squares solver
loiseaujc opened this issue · 6 comments
Description
Consider the following code
program main
use iso_fortran_env, only: output_unit
use stdlib_kinds, only: dp
use stdlib_linalg, only: solve_lstsq
implicit none
! Dimension of the problem.
integer, parameter :: m = 100, n = m-1
integer :: k
complex(dp) :: A(m, n), b(m), x_true(n), x_lstsq(n)
do k = 1, n
! Zero-out data.
A = 0.0_dp ; b = 0.0_dp ; x_true = 0.0_dp ; x_lstsq = 0.0_dp
! Generate a random least-squares problem of size (k+1, k).
call random_number(A(:k+1, :k)%re) ; call random_number(A(:k+1, :k)%im)
call random_number(x_true(:k)%re) ; call random_number(x_true(:k)%im)
b(:k+1) = matmul(A(:k+1, :k), x_true(:k))
! Solve the lstsq problem.
call solve_lstsq(A(:k+1, :k), b(:k+1), x_lstsq(:k))
! Check correctness of the solution.
write(output_unit, *) "Iteration k", k, norm2(abs(x_true - x_lstsq))
end do
end program main
Although it looks quite contrived, it is adapted from a naïve implementation of a gmres
solver where A
actually is upper Hessenberg. It thus corresponds to the increasingly larger least-squares problem that needs to be solved at each iteration of the gmres
solver. In this MWE, A
is generated as a random matrix just to make sure the error I'll report does not depends on the structure of the matrix.
Initially, everything runs just fine. However, the code crashes for k
equal 39 with the following error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f147f8eb51f in ???
#1 0x7f147f94b838 in ???
#2 0x7f147f94e452 in ???
#3 0x559663cefe05 in __stdlib_linalg_MOD_stdlib_linalg_z_solve_lstsq_one
at build/dependencies/stdlib/src/stdlib_linalg_least_squares.f90:1142
#4 0x559663cc6b47 in MAIN__
at app/main.f90:115
#5 0x559663ccd799 in main
at app/main.f90:3
Segmentation fault (core dumped)
<ERROR> Execution for object " MWE_stdlib_lstsq " returned exit code 139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139
Not that the error reported actually randomly switches between double free or corruption (out)
(with STOP 134
), SIGSEGV
(with STOP 139
), munmap_chunk(): invalid pointer
(with STOP 134
), or free(): invalide size
(with STOP 134
). If single precision is used instead of double precision, the code crashes at iteration k = 39
instead.
Expected Behaviour
The code should runs smoothly. Note that if real
variables instead of complex
ones are being used, no problem is encountered.
Version of stdlib
0.6.1
Platform and Architecture
Linux with Ubuntu 22.04
Additional Information
The code is generated with fpm
and the latest stdlib
version is fetched directly by fpm
. The code is compiled with gfortran 13.2.0
installed with conda
.
I just re-ran the code, and here is what I observe
Iteration k 37 1.0316717587493331E-013
Iteration k 38 6.1138174828153277E-014
Iteration k 39 200.59206843713523
Iteration k 40 NaN
Iteration k 41 8.7331139962376985E+111
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f821db2751f in ???
#1 0x7f821db8a449 in ???
#2 0x561f4d0df436 in ???
#3 0x561f4d0d1d86 in ???
#4 0x561f4d0d046e in ???
#5 0x7f821db0ed8f in ???
#6 0x7f821db0ee3f in ???
#7 0x561f4d0d04ad in ???
#8 0xffffffffffffffff in ???
Segmentation fault (core dumped)
<ERROR> Execution for object " MWE_stdlib_lstsq " returned exit code 139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139
Although the error is kind of random (sometimes 139, sometimes 134), it is always the same thing:
- Iteration 38 is fine.
- Iteration 39 runs but the result is incorrect.
- Iterations 40 to 42 return either a crazy large number or a
NaN
before the code eventually crashes.
The same applies if I declare all arrays as allocatable
and allocate them with the appropriate size at each iteration (followed by deallocation). Note finally that, if I use instead the least-squares I wrote in my package (which uses the gels
interface from stdlib_linalg
), everything runs fine. I believe that this problem comes the fix applied here #818.
I'm closing this issue for the moment. While in the train, I played around with the code and found that it is very subtle to reproduce. I'm not quite sure what is happening. I'll re-open some time later once I've investigated a bit more (I don't want to make you lose time on this for the moment).
Sorry for the back and forth but I've been able to isolate the smallest MWE and its quite reproducible. Here is the code
program main
use stdlib_kinds, only: dp
use stdlib_linalg, only: solve_lstsq
implicit none
! Dimension of the problem.
integer, parameter :: n = 42
! Data for the least-squares problem.
complex(dp) :: A(n+1, n), b(n+1), x_true(n), x_lstsq(n)
! Internal variables.
real(dp), allocatable :: tmp(:, :, :), tmp_vec(:, :)
! Zero-out data.
A = 0.0_dp ; b = 0.0_dp ; x_true = 0.0_dp ; x_lstsq = 0.0_dp
allocate(tmp(n+1, n, 2)) ; tmp = 0.0_dp
allocate(tmp_vec(n, 2)) ; tmp_vec = 0.0_dp
! Generate a random complex least-squares problem of size (n+1, n).
call random_number(tmp) ; call random_number(tmp_vec)
A = cmplx(tmp(:, :, 1), tmp(:, :, 2), kind=dp)
x_true = cmplx(tmp_vec(:, 1), tmp_vec(:, 2), kind=dp)
b = matmul(A, x_true)
! Solve the lstsq problem.
call solve_lstsq(A, b, x_lstsq)
! Dellocate arrays.
deallocate(tmp, tmp_vec)
end program main
This code crashes with the following stacktrace
Project is up to date
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f5fe0a0651f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x562c6d67ef43 in __stdlib_linalg_lapack_z_MOD_stdlib_zlarft
at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:11335
#2 0x562c6d57c5ca in __stdlib_linalg_lapack_z_MOD_stdlib_zunmqr
at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:31493
#3 0x562c6d420914 in __stdlib_linalg_lapack_z_MOD_stdlib_zunmbr
at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:64122
#4 0x562c6d41d275 in __stdlib_linalg_lapack_z_MOD_stdlib_zgelsd
at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:64721
#5 0x562c6cc5ba64 in __stdlib_linalg_MOD_stdlib_linalg_z_solve_lstsq_one
at build/dependencies/stdlib/src/stdlib_linalg_least_squares.f90:1129
#6 0x562c6cc39a9a in MAIN__
at app/main.f90:25
#7 0x562c6cc39b8e in main
at app/main.f90:2
Segmentation fault (core dumped)
<ERROR> Execution for object " MWE_stdlib_lstsq " returned exit code 139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139
The odd thing is that the following code runs smoothly
program main
use stdlib_kinds, only: dp
use stdlib_linalg, only: solve_lstsq
implicit none
! Dimension of the problem.
integer, parameter :: n = 42
! Data for the least-squares problem.
complex(dp) :: A(n+1, n), b(n+1), x_true(n), x_lstsq(n)
! Internal variables.
real(dp), allocatable :: tmp(:, :, :), tmp_vec(:, :)
! Zero-out data.
A = 0.0_dp ; b = 0.0_dp ; x_true = 0.0_dp ; x_lstsq = 0.0_dp
allocate(tmp(n+1, n, 2)) ; tmp = 0.0_dp
allocate(tmp_vec(n, 2)) ; tmp_vec = 0.0_dp
! Generate a random complex least-squares problem of size (n+1, n).
call random_number(tmp) ; call random_number(tmp_vec)
A = cmplx(tmp(:, :, 1), tmp(:, :, 2), kind=dp)
x_true = cmplx(tmp_vec(:, 1), tmp_vec(:, 2), kind=dp)
b = matmul(A, x_true)
! Dellocate arrays.
deallocate(tmp, tmp_vec)
! Solve the lstsq problem.
call solve_lstsq(A, b, x_lstsq)
end program main
The only difference is that I deallocate the tmp
and tmp_vec
arrays I used to generate my random complex arrays before actually calling solve_lstsq
. I have no idea what's going on.
I tested your MWE ... weirdly, if you use call solve_lstsq(A, b, x_lstsq, overwrite_a=.true.)
the segfault goes away, seems like an issue with the internal allocation. https://godbolt.org/z/xvj9EqeE6
@loiseaujc @jalvesz, sorry - I just noticed that you've filed this bug. If you'd like my intervention, please cc me as I don't receive notifications from stdlib issues otherwise. I've just tested it on macOS, gcc 13.2.0 (ARM) and the Godbolt example runs just fine in all configurations (overwrite_a = .true.
/.false.
, solve_lstsq
/lstsq
). So, I think this will need a more thorough look