Blosc/python-blosc2

illegal instruction on system without avx2

atom-andrew opened this issue · 11 comments

I'm trying to use python-blosc2 (2.6.2) on a system without avx2 but I'm getting an illegal instruction error on using avx2 instructions. The failed instruction disassembles as follows:

=> 0x00007f4773378934 <+212>:	vinserti128 $0x1,%xmm1,%ymm0,%ymm0

As you can see below (using gcc 10, which appears to be the major version used to compile the binary), avx2 is not reported as available on the system.

> cat builtin.c
#include <stdio.h>

int main(void) {
  __builtin_cpu_init();
  printf("%d\n", __builtin_cpu_supports ("sse2"));
  printf("%d\n", __builtin_cpu_supports ("avx"));
  printf("%d\n", __builtin_cpu_supports ("avx2"));
  printf("%d\n", __builtin_cpu_supports ("avx512bw"));
}
> /usr/bin/gcc-10 ./builtin.c; ./a.out
16
512
0
0

The c-blosc2 library I'm using is from the wheel and appears to be have been built with gcc 10.2.1:

strings -a  lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so | grep GCC
GCC: (GNU) 10.2.1 20210130 (Red Hat 10.2.1-11)

My expectation is that the library would not use shuffle_axv2 based on the runtime flags, but for some reason we python-blosc2 is trying to use it anyway. Is there any other reason that we might still try to use this despite the values from __builtin_cpu_supports? Thanks.

Which codec are you using inside blosc2? Can you send a code to reproduce the issue? There are chances that this comes from internal zlib-ng. For the development team, one of the issues is that finding boxes without AVX2 for testing is increasingly difficult for us.

This issue happened decompressing a value that was compressed using "zstd" with blosc1. FWIW, I can't seem to get this issue compressing and decompressing all within blosc2, although I don't have the same data in my experiments. I'm pretty the failure happens in the blosc2 library in unshuffle because this is also in the stack trace. Here is the bottom of the stack trace:

#0  __pthread_kill_implementation (no_tid=0, signo=4, threadid=139944524277312) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=4, threadid=139944524277312) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139944524277312, signo=signo@entry=4) at ./nptl/pthread_kill.c:89
#3  0x00007f477b03a476 in __GI_raise (sig=4) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  0x00007f4773378934 in unshuffle () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#6  0x00007f477334e6c3 in pipeline_backward () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#7  0x00007f47733535a7 in blosc_d () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#8  0x00007f4773353d25 in do_job () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#9  0x00007f477335547f in blosc_run_decompression_with_context () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#10 0x00007f4773355a54 in blosc2_decompress () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#11 0x00007f47733474c9 in __pyx_pf_6blosc2_10blosc2_ext_8decompress.constprop.0 () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so
#12 0x00007f4773347bc7 in __pyx_pw_6blosc2_10blosc2_ext_9decompress () from /home/andrew/miniconda3/envs/raven-env/lib/python3.12/site-packages/blosc2/blosc2_ext.cpython-312-x86_64-linux-gnu.so

I've been able to reproduce this using a simpler example on my avx2-free vm.

  1. Install python 3.11.
  2. Install both blosc and blosc2.
  3. Run this:
import blosc
import blosc2

x = blosc.compress(b"hello" * 1000, cname="zstd")
y = blosc2.decompress(x)  # yields an illegal instruction

The bottom of the stack trace looks like this:

(gdb)
#0  0x00007fc2f9df95d4 in unshuffle () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#1  0x00007fc2f9dcf363 in pipeline_backward () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#2  0x00007fc2f9dd4247 in blosc_d () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#3  0x00007fc2f9dd49c5 in do_job () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#4  0x00007fc2f9dd611f in blosc_run_decompression_with_context () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#5  0x00007fc2f9dd66f4 in blosc2_decompress () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#6  0x00007fc2f9d932d2 in __pyx_pf_6blosc2_10blosc2_ext_8decompress.constprop.0 () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#7  0x00007fc2f9d93907 in __pyx_pw_6blosc2_10blosc2_ext_9decompress () from /home/andrew/.pyenv/versions/3.11.9/envs/blosc-test/lib/python3.11/site-packages/blosc2/blosc2_ext.cpython-311-x86_64-linux-gnu.so
#8  0x00007fc3065e7528 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7fc3004fbe00, tstate=0x7fc306a176b8 <_PyRuntime+166328>) at ./Include/internal/pycore_call.h:92

This disassembly of the illegal instruction looks like this:

(gdb) disassemble 0x00007fc2f9df95d4
Dump of assembler code for function unshuffle:
   0x00007fc2f9df9500 <+0>:	mov    0x186da2(%rip),%eax        # 0x7fc2f9f802a8 <implementation_initialized>
   0x00007fc2f9df9506 <+6>:	test   %eax,%eax
   0x00007fc2f9df9508 <+8>:	je     0x7fc2f9df9510 <unshuffle+16>
   0x00007fc2f9df950a <+10>:	jmp    *0x186d80(%rip)        # 0x7fc2f9f80290 <host_implementation+16>
   0x00007fc2f9df9510 <+16>:	push   %rbp
   0x00007fc2f9df9511 <+17>:	lea    0x186dd8(%rip),%rax        # 0x7fc2f9f802f0 <__cpu_model>
   0x00007fc2f9df9518 <+24>:	lea    0xd7041(%rip),%r10        # 0x7fc2f9ed0560 <bshuf_untrans_bit_elem_AVX>
   0x00007fc2f9df951f <+31>:	lea    0xd5e0a(%rip),%r9        # 0x7fc2f9ecf330 <bshuf_trans_bit_elem_AVX>
   0x00007fc2f9df9526 <+38>:	lea    0xd55e3(%rip),%r8        # 0x7fc2f9eceb10 <shuffle_avx2>
   0x00007fc2f9df952d <+45>:	lea    0x1468d5(%rip),%r11        # 0x7fc2f9f3fe09
   0x00007fc2f9df9534 <+52>:	mov    %rsp,%rbp
   0x00007fc2f9df9537 <+55>:	push   %r12
   0x00007fc2f9df9539 <+57>:	push   %rbx
   0x00007fc2f9df953a <+58>:	mov    0xc(%rax),%ebx
   0x00007fc2f9df953d <+61>:	lea    0xd595c(%rip),%rax        # 0x7fc2f9eceea0 <unshuffle_avx2>
   0x00007fc2f9df9544 <+68>:	mov    %ebx,%r12d
   0x00007fc2f9df9547 <+71>:	and    $0x10,%r12d
   0x00007fc2f9df954b <+75>:	and    $0x4,%bh
   0x00007fc2f9df954e <+78>:	jne    0x7fc2f9df95ad <unshuffle+173>
   0x00007fc2f9df9550 <+80>:	test   %r12d,%r12d
   0x00007fc2f9df9553 <+83>:	lea    0xd0cc6(%rip),%rax        # 0x7fc2f9eca220 <bshuf_untrans_bit_elem_scal>
   0x00007fc2f9df955a <+90>:	lea    0xd392f(%rip),%r10        # 0x7fc2f9ecce90 <bshuf_untrans_bit_elem_SSE>
   0x00007fc2f9df9561 <+97>:	cmove  %rax,%r10
   0x00007fc2f9df9565 <+101>:	lea    0xd31f4(%rip),%r9        # 0x7fc2f9ecc760 <bshuf_trans_bit_elem_SSE>
   0x00007fc2f9df956c <+108>:	lea    0xd088d(%rip),%rax        # 0x7fc2f9ec9e00 <bshuf_trans_bit_elem_scal>
   0x00007fc2f9df9573 <+115>:	lea    0xd2446(%rip),%r8        # 0x7fc2f9ecb9c0 <unshuffle_sse2>
   0x00007fc2f9df957a <+122>:	cmove  %rax,%r9
   0x00007fc2f9df957e <+126>:	lea    0xd034b(%rip),%rax        # 0x7fc2f9ec98d0 <unshuffle_generic>
   0x00007fc2f9df9585 <+133>:	lea    0xd02c4(%rip),%r11        # 0x7fc2f9ec9850 <shuffle_generic>
   0x00007fc2f9df958c <+140>:	cmovne %r8,%rax
   0x00007fc2f9df9590 <+144>:	lea    0xd1f39(%rip),%r8        # 0x7fc2f9ecb4d0 <shuffle_sse2>
   0x00007fc2f9df9597 <+151>:	lea    0x13d46a(%rip),%rbx        # 0x7fc2f9f36a08
   0x00007fc2f9df959e <+158>:	cmove  %r11,%r8
   0x00007fc2f9df95a2 <+162>:	lea    0x14685b(%rip),%r11        # 0x7fc2f9f3fe04
   0x00007fc2f9df95a9 <+169>:	cmove  %rbx,%r11
   0x00007fc2f9df95ad <+173>:	vmovq  %r9,%xmm2
   0x00007fc2f9df95b2 <+178>:	vmovq  %r8,%xmm3
   0x00007fc2f9df95b7 <+183>:	mov    %r11,0x186cc2(%rip)        # 0x7fc2f9f80280 <host_implementation>
   0x00007fc2f9df95be <+190>:	vpinsrq $0x1,%r10,%xmm2,%xmm1
   0x00007fc2f9df95c4 <+196>:	vpinsrq $0x1,%rax,%xmm3,%xmm0
   0x00007fc2f9df95ca <+202>:	movl   $0x1,0x186cd4(%rip)        # 0x7fc2f9f802a8 <implementation_initialized>
=> 0x00007fc2f9df95d4 <+212>:	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
   0x00007fc2f9df95da <+218>:	vmovdqu %xmm0,0x186ca6(%rip)        # 0x7fc2f9f80288 <host_implementation+8>
   0x00007fc2f9df95e2 <+226>:	vextracti128 $0x1,%ymm0,0x186cac(%rip)        # 0x7fc2f9f80298 <host_implementation+24>
   0x00007fc2f9df95ec <+236>:	vzeroupper
   0x00007fc2f9df95ef <+239>:	pop    %rbx
   0x00007fc2f9df95f0 <+240>:	pop    %r12
   0x00007fc2f9df95f2 <+242>:	pop    %rbp
   0x00007fc2f9df95f3 <+243>:	jmp    *%rax
End of assembler dump.

It's worth noting that if I compress the data using blosc2 and the ZSTD codec, I don't have an issue.

I've had a chance to investigate this more. It's been a while since I've used gdb, so it took me a moment, but it's probably obvious to you that the issue is that we have an avx2 instruction in general unshuffle code, and we probably aren't calling unshuffle_avx2. I thought this might be because wheel built by gcc is building with the instruction set available on the build host, but curiously enough I build the wheel locally and got the same failure.

That's interesting. Can you double check that the issue is in generic unshuffle by putting e.g. a printf somewhere in https://github.com/Blosc/c-blosc2/blob/main/blosc/shuffle-generic.h#L62 ? Chances are that this function is optimized out by modern compilers with AVX2. Also, can you check that if you pass a chunk compressed with blosc2, this also happens?

At Blosc/c-blosc2#436, it looks like we started compiling shuffle.c using -mavx2 and -msse2. I expect this is the source of the issue.

@FrancescAlted I know I didn't get around to following up with the proof of where the code was generated using "printf", but I think the link I provided and a glance at the build logs would show that a potentially unavailable instruction set is being targeted by the compiler when building this file. Please let me know if this is something you might look into, otherwise I'll need to find a workaround. Thanks.

We are discussing about this right now in PR Blosc/c-blosc2#622. If you can provide help (for code or testing), you are more than welcome.

After Blosc/c-blosc2#622 being merged, this should be fixed. Please feel free to reopen it if necessary.

@FrancescAlted Is a 2.x python-blosc2 release planned with these changes to c-blosc2? Thanks.

Python-Blosc2 2.7.1 is out. It should fix this issue.