fortran-lang/stdlib

Bug in the complex least-squares solver

loiseaujc opened this issue · 6 comments

Description

Consider the following code

program main
  use iso_fortran_env, only: output_unit
  use stdlib_kinds, only: dp
  use stdlib_linalg, only: solve_lstsq
  implicit none

  ! Dimension of the problem.
  integer, parameter :: m = 100, n = m-1
  integer :: k
  complex(dp) :: A(m, n), b(m), x_true(n), x_lstsq(n)

  do k = 1, n
      ! Zero-out data.
      A = 0.0_dp ; b = 0.0_dp ; x_true = 0.0_dp ; x_lstsq = 0.0_dp

      ! Generate a random least-squares problem of size (k+1, k).
      call random_number(A(:k+1, :k)%re) ; call random_number(A(:k+1, :k)%im)
      call random_number(x_true(:k)%re) ; call random_number(x_true(:k)%im)
      b(:k+1) = matmul(A(:k+1, :k), x_true(:k))

      ! Solve the lstsq problem.
      call solve_lstsq(A(:k+1, :k), b(:k+1), x_lstsq(:k))

      ! Check correctness of the solution.
      write(output_unit, *) "Iteration k", k, norm2(abs(x_true - x_lstsq))
  end do

end program main

Although it looks quite contrived, it is adapted from a naïve implementation of a gmres solver where A actually is upper Hessenberg. It thus corresponds to the increasingly larger least-squares problem that needs to be solved at each iteration of the gmres solver. In this MWE, A is generated as a random matrix just to make sure the error I'll report does not depends on the structure of the matrix.

Initially, everything runs just fine. However, the code crashes for k equal 39 with the following error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f147f8eb51f in ???
#1  0x7f147f94b838 in ???
#2  0x7f147f94e452 in ???
#3  0x559663cefe05 in __stdlib_linalg_MOD_stdlib_linalg_z_solve_lstsq_one
	at build/dependencies/stdlib/src/stdlib_linalg_least_squares.f90:1142
#4  0x559663cc6b47 in MAIN__
	at app/main.f90:115
#5  0x559663ccd799 in main
	at app/main.f90:3
Segmentation fault (core dumped)
<ERROR> Execution for object " MWE_stdlib_lstsq " returned exit code  139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139

Not that the error reported actually randomly switches between double free or corruption (out) (with STOP 134), SIGSEGV (with STOP 139), munmap_chunk(): invalid pointer (with STOP 134), or free(): invalide size (with STOP 134). If single precision is used instead of double precision, the code crashes at iteration k = 39 instead.

Expected Behaviour

The code should runs smoothly. Note that if real variables instead of complex ones are being used, no problem is encountered.

Version of stdlib

0.6.1

Platform and Architecture

Linux with Ubuntu 22.04

Additional Information

The code is generated with fpm and the latest stdlib version is fetched directly by fpm. The code is compiled with gfortran 13.2.0 installed with conda.

I just re-ran the code, and here is what I observe

 Iteration k          37   1.0316717587493331E-013
 Iteration k          38   6.1138174828153277E-014
 Iteration k          39   200.59206843713523     
 Iteration k          40                       NaN
 Iteration k          41   8.7331139962376985E+111

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f821db2751f in ???
#1  0x7f821db8a449 in ???
#2  0x561f4d0df436 in ???
#3  0x561f4d0d1d86 in ???
#4  0x561f4d0d046e in ???
#5  0x7f821db0ed8f in ???
#6  0x7f821db0ee3f in ???
#7  0x561f4d0d04ad in ???
#8  0xffffffffffffffff in ???
Segmentation fault (core dumped)
<ERROR> Execution for object " MWE_stdlib_lstsq " returned exit code  139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139

Although the error is kind of random (sometimes 139, sometimes 134), it is always the same thing:

  • Iteration 38 is fine.
  • Iteration 39 runs but the result is incorrect.
  • Iterations 40 to 42 return either a crazy large number or a NaN before the code eventually crashes.

The same applies if I declare all arrays as allocatable and allocate them with the appropriate size at each iteration (followed by deallocation). Note finally that, if I use instead the least-squares I wrote in my package (which uses the gels interface from stdlib_linalg), everything runs fine. I believe that this problem comes the fix applied here #818.

I'm closing this issue for the moment. While in the train, I played around with the code and found that it is very subtle to reproduce. I'm not quite sure what is happening. I'll re-open some time later once I've investigated a bit more (I don't want to make you lose time on this for the moment).

Sorry for the back and forth but I've been able to isolate the smallest MWE and its quite reproducible. Here is the code

program main
  use stdlib_kinds, only: dp
  use stdlib_linalg, only: solve_lstsq
  implicit none

  ! Dimension of the problem.
  integer, parameter :: n = 42
  ! Data for the least-squares problem.
  complex(dp) :: A(n+1, n), b(n+1), x_true(n), x_lstsq(n)
  ! Internal variables.
  real(dp), allocatable :: tmp(:, :, :), tmp_vec(:, :)

  ! Zero-out data.
  A = 0.0_dp ; b = 0.0_dp ; x_true = 0.0_dp ; x_lstsq = 0.0_dp
  allocate(tmp(n+1, n, 2))  ; tmp = 0.0_dp
  allocate(tmp_vec(n, 2)) ; tmp_vec = 0.0_dp

  ! Generate a random complex least-squares problem of size (n+1, n).
  call random_number(tmp) ; call random_number(tmp_vec)
  A = cmplx(tmp(:, :, 1), tmp(:, :, 2), kind=dp)
  x_true = cmplx(tmp_vec(:, 1), tmp_vec(:, 2), kind=dp)
  b = matmul(A, x_true)

  ! Solve the lstsq problem.
  call solve_lstsq(A, b, x_lstsq)

  ! Dellocate arrays.
  deallocate(tmp, tmp_vec)

end program main

This code crashes with the following stacktrace

Project is up to date

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f5fe0a0651f in ???
        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1  0x562c6d67ef43 in __stdlib_linalg_lapack_z_MOD_stdlib_zlarft
        at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:11335
#2  0x562c6d57c5ca in __stdlib_linalg_lapack_z_MOD_stdlib_zunmqr
        at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:31493
#3  0x562c6d420914 in __stdlib_linalg_lapack_z_MOD_stdlib_zunmbr
        at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:64122
#4  0x562c6d41d275 in __stdlib_linalg_lapack_z_MOD_stdlib_zgelsd
        at build/dependencies/stdlib/src/stdlib_linalg_lapack_z.F90:64721
#5  0x562c6cc5ba64 in __stdlib_linalg_MOD_stdlib_linalg_z_solve_lstsq_one
        at build/dependencies/stdlib/src/stdlib_linalg_least_squares.f90:1129
#6  0x562c6cc39a9a in MAIN__
        at app/main.f90:25
#7  0x562c6cc39b8e in main
        at app/main.f90:2
Segmentation fault (core dumped)
<ERROR> Execution for object " MWE_stdlib_lstsq " returned exit code  139
<ERROR> *cmd_run*:stopping due to failed executions
STOP 139

The odd thing is that the following code runs smoothly

program main
  use stdlib_kinds, only: dp
  use stdlib_linalg, only: solve_lstsq
  implicit none

  ! Dimension of the problem.
  integer, parameter :: n = 42
  ! Data for the least-squares problem.
  complex(dp) :: A(n+1, n), b(n+1), x_true(n), x_lstsq(n)
  ! Internal variables.
  real(dp), allocatable :: tmp(:, :, :), tmp_vec(:, :)

  ! Zero-out data.
  A = 0.0_dp ; b = 0.0_dp ; x_true = 0.0_dp ; x_lstsq = 0.0_dp
  allocate(tmp(n+1, n, 2))  ; tmp = 0.0_dp
  allocate(tmp_vec(n, 2)) ; tmp_vec = 0.0_dp

  ! Generate a random complex least-squares problem of size (n+1, n).
  call random_number(tmp) ; call random_number(tmp_vec)
  A = cmplx(tmp(:, :, 1), tmp(:, :, 2), kind=dp)
  x_true = cmplx(tmp_vec(:, 1), tmp_vec(:, 2), kind=dp)
  b = matmul(A, x_true)

  ! Dellocate arrays.
  deallocate(tmp, tmp_vec)

  ! Solve the lstsq problem.
  call solve_lstsq(A, b, x_lstsq)

end program main

The only difference is that I deallocate the tmp and tmp_vec arrays I used to generate my random complex arrays before actually calling solve_lstsq. I have no idea what's going on.

I tested your MWE ... weirdly, if you use call solve_lstsq(A, b, x_lstsq, overwrite_a=.true.) the segfault goes away, seems like an issue with the internal allocation. https://godbolt.org/z/xvj9EqeE6

@loiseaujc @jalvesz, sorry - I just noticed that you've filed this bug. If you'd like my intervention, please cc me as I don't receive notifications from stdlib issues otherwise. I've just tested it on macOS, gcc 13.2.0 (ARM) and the Godbolt example runs just fine in all configurations (overwrite_a = .true./.false., solve_lstsq/lstsq). So, I think this will need a more thorough look

imagen
@perazz I just reran the example changing overwrite_a = .false. and get a segfault in godbolt (notice that I also had to modify the size of x_lstsq(n+1)) for it to run.

Maybe the version of stdlib in Compiler Explorer is not up-to-date ?