ralna/spral

SSMFE C interface segfaults on Windows

jfowkes opened this issue · 15 comments

Moving over to meson has enabled us to test on Windows and this has exposed a segfault in the SSMFE C interface:

test:         ssmfet_c
start time:   14:05:58
duration:     0.05s
result:       (exit status 3221225725 or signal 3221225597 SIGinvalid)

Note entirely sure what these strangely large exit statuses mean. @amontoison?

@jfowkes It tried to find something with a highest warning level in #159 but I found nothing :(
Maybe you could try ralna/GALAHAD#108?

@amontoison good shout, will see if I can run some sanitisers...

@amontoison I'm getting:

FAILED: libspral.dll 
"gfortran" @libspral.dll.rsp
c:/programdata/chocolatey/lib/mingw/tools/install/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: 
cannot find -lasan: No such file or directory

c:/programdata/chocolatey/lib/mingw/tools/install/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: 
cannot find -lubsan: No such file or directory

I guess the sanitizers don't work on windows?

I checked online and the sanitizers are not working with GCC on Windows.
https://stackoverflow.com/questions/55018627/cannot-find-lasan-using-address-sanitizer-in-mingw-in-windows-mingw
Maybe we should split ssmfet_c into smaller unit tests to isolate the issue?

Good plan! I will have a go next week at splitting up the ssmfet_c tests to try and isolate the issue.

@amontoison here is the C main function for the SSMFE test:

int main(void) {

  int errors = 0;
  int err;

  fprintf(stdout, "testing ssmfe_core...\n");
  err = test_core();
  errors += err;
  fprintf(stdout, "%d errors\n", err);

  fprintf(stdout, "testing ssmfe_expert...\n");
  err = test_expert();
  errors += err;
  fprintf(stdout, "%d errors\n", err);

  fprintf(stdout, "testing ssmfe...\n");
  err = test_ssmfe();
  errors += err;
  fprintf(stdout, "%d errors\n", err);

  fprintf(stdout, "=============================\n");
  fprintf(stdout, "Total number of errors = %d\n", errors);

  return errors;
}

Why are we not seeing the first print line (testing ssmfe_core...) being printed in the logs on Windows? Is this because the test errors out before even getting to this line?

Can you comment the first test with test_core()?
I suspect that test_core() failed and the value err is never defined inside this function.

Indeed that appears to be the case, I've just flushed the print statements in main and I get:

----------------------------------- stdout -----------------------------------
testing ssmfe_core...

before it crashes. I will add some more flushes to test_core to try isolate the issue.

@amontoison okay I have tracked this issue down to the following VLA allocation in the test_core_z double complex test routine:

double complex X[n][n];        /* eigenvectors storage */

where n=400 so this tries to allocate a 400x400 double complex VLA. So it looks like we're getting a stack overflow, is the Windows stack just really tiny or something??

EDIT: according to my calculations the size of X is only 2.56 MB!

I checked a little bit online and it seems VLA could not be supported by default without the preprocessing flag __STDC_VLA__.

https://groups.google.com/g/comp.std.c/c/AoB6LFHcd88

So what you're saying is that on Windows MinGW defines __STDC_NO_VLA__? I find that hard to believe...

VLAs are not supported by MSVC so it could explain that gcc on Windows defines it.

I don't think VLAs are the issue, since the test_core_d double test routine has:

double X[n][n];                /* eigenvectors storage */

and this passes the test on Windows (see the log in #162 )!!