Illegal instruction when built within singularity

Question

Illegal instruction when built within singularity

Closed this issue 2 months ago · 20 comments

Greetings again. I've been trying to get my program working on a supercomputer within singularity, and it experienced strange crashes that turned to be related to flint. After a series of steps I ended on a simple program reproducing this behaviour. The strange part about it is that the failure is not reproduced when run outside of singularity.

The program is simple:

#include <string>
#include <iostream>
#include <fstream>
#include <vector>

#include "flint/fmpz_mpoly.h"

int main(int argc, char** argv) {
    fmpz_mpoly_ctx_t ctx;
    std::vector<std::string> vars {"d23", "d45", "d15", "d12", "d34", "d"};
    size_t nvars = vars.size();
    fmpz_mpoly_ctx_init(ctx, nvars, ORD_LEX);
    fmpz_mpoly_t num;
    fmpz_mpoly_t denom;
    fmpz_mpoly_t gcd;
    fmpz_mpoly_init(num, ctx);
    fmpz_mpoly_init(denom, ctx);
    fmpz_mpoly_init(gcd, ctx);
    std::vector<const char*> vars_vector;
    vars_vector.resize(nvars);
    size_t i = 0;
    for (const auto& var : vars) {
        vars_vector[i] = var.c_str();
        ++i;
    }
    std::string i1,i2;
    std::ifstream ifs1("inp1.m");
    ifs1 >> i1;
    ifs1.close();
    std::ifstream ifs2("inp2.m");
    ifs2 >> i2;
    ifs2.close();
    fmpz_mpoly_set_str_pretty(num, i1.c_str(), &vars_vector[0], ctx);  
    fmpz_mpoly_set_str_pretty(denom, i2.c_str(), &vars_vector[0], ctx);
        fmpz_mpoly_gcd(gcd, num, denom, ctx);
       char* res = fmpz_mpoly_get_str_pretty(gcd, &vars_vector[0], ctx);
    std::cout << std::string(res) << std::endl;
    fflush(stdout);
    return 0;
}

The required files are attached (or if not I will need a way to upload them)
At the gcd calculation in produces "Illegal instruction"

But outside ot singularity the answer is 1.

I tried versions of flint starting from Jan 2024 till v3.1.3-p1.

I can provide the singularity instructions or test more versions, any idea how to approach it?

inp2.m.gz
inp1.m.gz

Answer 1 · 2024-08-24T10:30:54.000Z

Output under gdb

(gdb) run
Starting program: /opt/fire/FIRE7/extra/fuel/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGILL, Illegal instruction.
0x00007ffff755b8c8 in flint_mpn_mul_9_4 () at build/mpn_extras/broadwell/mul_hard_pic.s:2082
2082 build/mpn_extras/broadwell/mul_hard_pic.s: No such file or directory.

Answer 2 · 2024-08-24T10:36:36.000Z

Forgot to add, that the program perfectly calls gcd thousands or millions of time... but this was the first pair of expressions I found a crash

Answer 3 · 2024-08-24T12:12:35.000Z

Probably this is not related to you but to olde singularity version...

Answer 4 · 2024-08-24T15:36:13.000Z

Couple of questions:

I have never had a problem with GDB for these functions. What compiler and version of GDB are you using?
Are you using an x86 processor?
Does build/mpn_extras/broadwell/mul_hard_pic.s exist?
Do you know what it says the illegal instruction is? On a somewhat up-to-date FLINT version, this line corresponds to a macro. If your compiler does not support macros, then I suppose we have to eliminate these assembler macros. However, the program should not even compile here. Are you only getting errors with GDB, or do you get errors without GDB as well?
Can you confirm that you are compiling the shared library version?

Answer 5 · 2024-08-24T15:52:18.000Z

2 - yes for both hosts. I have an AMD Ryzen on the machine I build it and Intel Core where I run it
3 - amazingly, no. I first htought I mislooked it, but it does not now exist. Did it exest in any of your versions and then was removed?
4 - it just prints "Illegal instruction" and the process disappeared. With v3.1.3-p1 to (I started from a January version where I was before)
5 - The buildchain is

        cd extra/flint && ./bootstrap.sh
        cd extra/flint && automake --force-missing --add-missing || echo "This error is not important"
        cd extra/flint && rm -rf autom4te.cache/
        export CXX=$(CCPLUS) && export CC=$(CC) && cd extra/flint && ./configure --with-mpfr=`pwd`/../../usr/ --with-gmp-include=`pwd`/../../usr/include --with-gmp-lib=`pwd`/../../usr/include --prefix=`pwd`/../../usr
        $(MAKE) -C extra/flint
        $(MAKE) -C extra/flint install

To elaborate, I build it in singularity with

singularity build --sandbox fire/ docker://nvcr.io/nvidia/base/ubuntu:22.04_20240212 
singularity shell --writable fire/

Then use the container installing some libraries

  apt-get install git g++ cmake
  apt-get install ping
  apt-get install iputils-ping
  apt-get install vim
  apt-get install wget
  apt-get install curl
  apt-get install xz
  apt-get install xzip
  apt-get install xunzip
  apt-get install xz-utils
  apt-get install m4
  apt-get install autoreconf
  apt-get install autoconf
  apt-get install libtool

then

singularity build fire.simg fire/

The image is transfered to another computer and launched
singularity shell ./fire.simg
then run this simple code there

Answer 6 · 2024-08-24T16:06:49.000Z

Still need compiler and GDB version.

2 - yes for both hosts. I have an AMD Ryzen on the machine I build it and Intel Core where I run it 3 - amazingly, no. I first htought I mislooked it, but it does not now exist. Did it exest in any of your versions and then was removed? 4 - it just prints "Illegal instruction" and the process disappeared. With v3.1.3-p1 to (I started from a January version where I was before)

You should be able to generate that file via make build/mpn_extras/broadwell/mul_hard_pic.s. It may be removed as it is an intermediate file, but I believe I changed it in the main branch so that is never removed.

I need this because I need to know exactly which instruction it considers illegal.

I would suspect that the cross-compilation without specifying host architecture is causing the problems.

Answer 7 · 2024-08-24T16:08:55.000Z

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

I will try to create the file and rerun and report then (it takes time to build and transfer the image)

Answer 8 · 2024-08-24T16:12:15.000Z

Please ensure that you specify the host architecture when compiling.

Answer 9 · 2024-08-24T16:14:26.000Z

Looks it might really be the difference of hosts. The cluster admins amazingly did not warn me about it, just told to build and transfer the container (a user cannot build is on a cluster since root is needed)

Answer 10 · 2024-08-24T16:15:17.000Z

0x00007ffff755b8c8 in flint_mpn_mul_9_4 () at build/mpn_extras/broadwell/mul_hard_pic.s:2082
warning: Source file is more recent than executable.
2082		m4	%rdi, 0, %rcx, 0, %r8, %rax, %r9, %r10, %r11, %rbx, %rbp

Answer 11 · 2024-08-24T16:16:37.000Z

Please ensure that you specify the host architecture when compiling.

I did not ensure yet, need to look for options suitable for the cluster first

Answer 12 · 2024-08-24T17:52:48.000Z

Sorry, could you point at the documentation that lists target hosts? I assume, the --host= instruction

In particular, I need to build for

vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
stepping : 2
microcode : 0x46
cpu MHz : 1234.796
cache size : 35840 KB
physical id : 1
siblings : 28
core id : 14
cpu cores : 14
apicid : 61
initial apicid : 61
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
bogomips : 5193.83
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

Answer 13 · 2024-08-24T18:18:49.000Z

On the main branch, run cd config && ./config/config.guess on the system you want to get the system triplet for. Note that FLINT 3.1.X does not support this style of specifying --host; therefore you either need to use the main branch or simply avoid using the -march flag in CFLAGS.

Answer 14 · 2024-08-24T18:26:58.000Z

Not that the guess helps much)
On my laptop it is x86_64-pc-linux-gnu, on the server - x86_64-unknown-linux-gnu
But I am not using any march flags in CFLAGS anyway... whiel flint might add -march=native I think
Does switching to mauin branch and providing host = x86_64-unknown-linux-gnu make any sense?

Answer 15 · 2024-08-24T18:29:38.000Z

Just to confirm: You got that result on an up-to-date main branch?

Answer 16 · 2024-08-24T18:31:18.000Z

oupc, forgot to pull
It's haswell-pc-linux-gnu then

Answer 17 · 2024-08-24T18:31:54.000Z

And zen-pc-linux-gnu on the laptop

Answer 18 · 2024-08-24T18:34:11.000Z

Not weird at all then, different instruction sets supported. In particular, the ADX instruction set heavily utilized for multiple precision arithmetic is not supported by Haswell.

I will close this since I do not think the issue comes from FLINT. Please reopen if you believe the issue comes from us.

Answer 19 · 2024-08-24T18:37:05.000Z

I see, thank you for support!

Answer 20 · 2024-08-24T19:59:23.000Z

Indeed export CPPFLAGS=-march=haswell helped.
Thank you once more!