illegal instruction on system without avx2
atom-andrew opened this issue · 11 comments
I'm trying to use python-blosc2 (2.6.2) on a system without avx2 but I'm getting an illegal instruction error on using avx2 instructions. The failed instruction disassembles as follows:
=> 0x00007f4773378934 <+212>: vinserti128 $0x1,%xmm1,%ymm0,%ymm0
As you can see below (using gcc 10, which appears to be the major version used to compile the binary), avx2 is not reported as available on the system.
> cat builtin.c
#include <stdio.h>
int main(void) {
__builtin_cpu_init();
printf("%d\n", __builtin_cpu_supports ("sse2"));
printf("%d\n", __builtin_cpu_supports ("avx"));
printf("%d\n", __builtin_cpu_supports ("avx2"));
printf("%d\n", __builtin_cpu_supports ("avx512bw"));
}
> /usr/bin/gcc-10 ./builtin.c; ./a.out
16
512
0
0
The c-blosc2 library I'm using is from the wheel and appears to be have been built with gcc 10.2.1:
strings -a lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so | grep GCC
GCC: (GNU) 10.2.1 20210130 (Red Hat 10.2.1-11)
My expectation is that the library would not use shuffle_axv2
based on the runtime flags, but for some reason we python-blosc2 is trying to use it anyway. Is there any other reason that we might still try to use this despite the values from __builtin_cpu_supports
? Thanks.
Which codec are you using inside blosc2? Can you send a code to reproduce the issue? There are chances that this comes from internal zlib-ng. For the development team, one of the issues is that finding boxes without AVX2 for testing is increasingly difficult for us.
This issue happened decompressing a value that was compressed using "zstd" with blosc1. FWIW, I can't seem to get this issue compressing and decompressing all within blosc2, although I don't have the same data in my experiments. I'm pretty the failure happens in the blosc2 library in unshuffle
because this is also in the stack trace. Here is the bottom of the stack trace:
#0 __pthread_kill_implementation (no_tid=0, signo=4, threadid=139944524277312) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=4, threadid=139944524277312) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139944524277312, signo=signo@entry=4) at ./nptl/pthread_kill.c:89
#3 0x00007f477b03a476 in __GI_raise (sig=4) at ../sysdeps/posix/raise.c:26
#4 <signal handler called>
#5 0x00007f4773378934 in unshuffle () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#6 0x00007f477334e6c3 in pipeline_backward () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#7 0x00007f47733535a7 in blosc_d () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#8 0x00007f4773353d25 in do_job () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#9 0x00007f477335547f in blosc_run_decompression_with_context () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#10 0x00007f4773355a54 in blosc2_decompress () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#11 0x00007f47733474c9 in __pyx_pf_6blosc2_10blosc2_ext_8decompress.constprop.0 () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#12 0x00007f4773347bc7 in __pyx_pw_6blosc2_10blosc2_ext_9decompress () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
I've been able to reproduce this using a simpler example on my avx2-free vm.
- Install python 3.11.
- Install both blosc and blosc2.
- Run this:
import blosc
import blosc2
x = blosc.compress(b"hello" * 1000, cname="zstd")
y = blosc2.decompress(x) # yields an illegal instruction
The bottom of the stack trace looks like this:
(gdb)
#0 0x00007fc2f9df95d4 in unshuffle () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#1 0x00007fc2f9dcf363 in pipeline_backward () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#2 0x00007fc2f9dd4247 in blosc_d () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#3 0x00007fc2f9dd49c5 in do_job () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#4 0x00007fc2f9dd611f in blosc_run_decompression_with_context () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#5 0x00007fc2f9dd66f4 in blosc2_decompress () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#6 0x00007fc2f9d932d2 in __pyx_pf_6blosc2_10blosc2_ext_8decompress.constprop.0 () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#7 0x00007fc2f9d93907 in __pyx_pw_6blosc2_10blosc2_ext_9decompress () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#8 0x00007fc3065e7528 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7fc3004fbe00, tstate=0x7fc306a176b8 <_PyRuntime+166328>) at ./Include/internal/pycore_call.h:92
This disassembly of the illegal instruction looks like this:
(gdb) disassemble 0x00007fc2f9df95d4
Dump of assembler code for function unshuffle:
0x00007fc2f9df9500 <+0>: mov 0x186da2(%rip),%eax # 0x7fc2f9f802a8 <implementation_initialized>
0x00007fc2f9df9506 <+6>: test %eax,%eax
0x00007fc2f9df9508 <+8>: je 0x7fc2f9df9510 <unshuffle+16>
0x00007fc2f9df950a <+10>: jmp *0x186d80(%rip) # 0x7fc2f9f80290 <host_implementation+16>
0x00007fc2f9df9510 <+16>: push %rbp
0x00007fc2f9df9511 <+17>: lea 0x186dd8(%rip),%rax # 0x7fc2f9f802f0 <__cpu_model>
0x00007fc2f9df9518 <+24>: lea 0xd7041(%rip),%r10 # 0x7fc2f9ed0560 <bshuf_untrans_bit_elem_AVX>
0x00007fc2f9df951f <+31>: lea 0xd5e0a(%rip),%r9 # 0x7fc2f9ecf330 <bshuf_trans_bit_elem_AVX>
0x00007fc2f9df9526 <+38>: lea 0xd55e3(%rip),%r8 # 0x7fc2f9eceb10 <shuffle_avx2>
0x00007fc2f9df952d <+45>: lea 0x1468d5(%rip),%r11 # 0x7fc2f9f3fe09
0x00007fc2f9df9534 <+52>: mov %rsp,%rbp
0x00007fc2f9df9537 <+55>: push %r12
0x00007fc2f9df9539 <+57>: push %rbx
0x00007fc2f9df953a <+58>: mov 0xc(%rax),%ebx
0x00007fc2f9df953d <+61>: lea 0xd595c(%rip),%rax # 0x7fc2f9eceea0 <unshuffle_avx2>
0x00007fc2f9df9544 <+68>: mov %ebx,%r12d
0x00007fc2f9df9547 <+71>: and $0x10,%r12d
0x00007fc2f9df954b <+75>: and $0x4,%bh
0x00007fc2f9df954e <+78>: jne 0x7fc2f9df95ad <unshuffle+173>
0x00007fc2f9df9550 <+80>: test %r12d,%r12d
0x00007fc2f9df9553 <+83>: lea 0xd0cc6(%rip),%rax # 0x7fc2f9eca220 <bshuf_untrans_bit_elem_scal>
0x00007fc2f9df955a <+90>: lea 0xd392f(%rip),%r10 # 0x7fc2f9ecce90 <bshuf_untrans_bit_elem_SSE>
0x00007fc2f9df9561 <+97>: cmove %rax,%r10
0x00007fc2f9df9565 <+101>: lea 0xd31f4(%rip),%r9 # 0x7fc2f9ecc760 <bshuf_trans_bit_elem_SSE>
0x00007fc2f9df956c <+108>: lea 0xd088d(%rip),%rax # 0x7fc2f9ec9e00 <bshuf_trans_bit_elem_scal>
0x00007fc2f9df9573 <+115>: lea 0xd2446(%rip),%r8 # 0x7fc2f9ecb9c0 <unshuffle_sse2>
0x00007fc2f9df957a <+122>: cmove %rax,%r9
0x00007fc2f9df957e <+126>: lea 0xd034b(%rip),%rax # 0x7fc2f9ec98d0 <unshuffle_generic>
0x00007fc2f9df9585 <+133>: lea 0xd02c4(%rip),%r11 # 0x7fc2f9ec9850 <shuffle_generic>
0x00007fc2f9df958c <+140>: cmovne %r8,%rax
0x00007fc2f9df9590 <+144>: lea 0xd1f39(%rip),%r8 # 0x7fc2f9ecb4d0 <shuffle_sse2>
0x00007fc2f9df9597 <+151>: lea 0x13d46a(%rip),%rbx # 0x7fc2f9f36a08
0x00007fc2f9df959e <+158>: cmove %r11,%r8
0x00007fc2f9df95a2 <+162>: lea 0x14685b(%rip),%r11 # 0x7fc2f9f3fe04
0x00007fc2f9df95a9 <+169>: cmove %rbx,%r11
0x00007fc2f9df95ad <+173>: vmovq %r9,%xmm2
0x00007fc2f9df95b2 <+178>: vmovq %r8,%xmm3
0x00007fc2f9df95b7 <+183>: mov %r11,0x186cc2(%rip) # 0x7fc2f9f80280 <host_implementation>
0x00007fc2f9df95be <+190>: vpinsrq $0x1,%r10,%xmm2,%xmm1
0x00007fc2f9df95c4 <+196>: vpinsrq $0x1,%rax,%xmm3,%xmm0
0x00007fc2f9df95ca <+202>: movl $0x1,0x186cd4(%rip) # 0x7fc2f9f802a8 <implementation_initialized>
=> 0x00007fc2f9df95d4 <+212>: vinserti128 $0x1,%xmm1,%ymm0,%ymm0
0x00007fc2f9df95da <+218>: vmovdqu %xmm0,0x186ca6(%rip) # 0x7fc2f9f80288 <host_implementation+8>
0x00007fc2f9df95e2 <+226>: vextracti128 $0x1,%ymm0,0x186cac(%rip) # 0x7fc2f9f80298 <host_implementation+24>
0x00007fc2f9df95ec <+236>: vzeroupper
0x00007fc2f9df95ef <+239>: pop %rbx
0x00007fc2f9df95f0 <+240>: pop %r12
0x00007fc2f9df95f2 <+242>: pop %rbp
0x00007fc2f9df95f3 <+243>: jmp *%rax
End of assembler dump.
It's worth noting that if I compress the data using blosc2 and the ZSTD codec, I don't have an issue.
I've had a chance to investigate this more. It's been a while since I've used gdb, so it took me a moment, but it's probably obvious to you that the issue is that we have an avx2 instruction in general unshuffle
code, and we probably aren't calling unshuffle_avx2
. I thought this might be because wheel built by gcc is building with the instruction set available on the build host, but curiously enough I build the wheel locally and got the same failure.
That's interesting. Can you double check that the issue is in generic unshuffle
by putting e.g. a printf
somewhere in https://github.com/Blosc/c-blosc2/blob/main/blosc/shuffle-generic.h#L62 ? Chances are that this function is optimized out by modern compilers with AVX2. Also, can you check that if you pass a chunk compressed with blosc2, this also happens?
At Blosc/c-blosc2#436, it looks like we started compiling shuffle.c
using -mavx2
and -msse2
. I expect this is the source of the issue.
@FrancescAlted I know I didn't get around to following up with the proof of where the code was generated using "printf", but I think the link I provided and a glance at the build logs would show that a potentially unavailable instruction set is being targeted by the compiler when building this file. Please let me know if this is something you might look into, otherwise I'll need to find a workaround. Thanks.
We are discussing about this right now in PR Blosc/c-blosc2#622. If you can provide help (for code or testing), you are more than welcome.
After Blosc/c-blosc2#622 being merged, this should be fixed. Please feel free to reopen it if necessary.
@FrancescAlted Is a 2.x python-blosc2 release planned with these changes to c-blosc2? Thanks.
Python-Blosc2 2.7.1 is out. It should fix this issue.