grrrr/flext

Bus error

i-n-g-o opened this issue · 13 comments

Hello.

I ran into problems using flext when compiling externals (into a library) without using FLEXT_USE_CMEM on a bela (using a xenomai linux).
I did not dig deep - but it results in a Bus Error

On a desktop linux (archlinux) it does not show this behaviour - the externals load and work fine.

Any ideas where this may come from?

i was just going to report the same thing for Debian/armhf.

Debian/armhf targets processors like the RaspberryPi 😞 (although i tested on an OdroidXU4)

here's a backtrace:

(gdb) run
Starting program: /usr/bin/pd -nogui -nrt -nosound -nomidi -lib simple1
warning: Error disabling address space randomization: Success
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
[New Thread 0xb5ea43d0 (LWP 13648)]

Thread 1 "pd" received signal SIGBUS, Bus error.
0xb69a17a4 in lockfree::CAS2<lockfree::atomic_ptr<lockfree::stack_node>, lockfree::stack_node*, unsigned int> (new2=1, new1=<optimized out>, old2=0, old1=0x0, addr=0xb69c3c04 <ThrRegistry::pending>) at lockfree/cas.hpp:129
129	lockfree/cas.hpp: No such file or directory.
(gdb) bt
#0  0xb69a17a4 in lockfree::CAS2<lockfree::atomic_ptr<lockfree::stack_node>, lockfree::stack_node*, unsigned int> (new2=1, new1=<optimized out>, old2=0, old1=0x0, addr=0xb69c3c04 <ThrRegistry::pending>) at lockfree/cas.hpp:129
#1  lockfree::atomic_ptr<lockfree::stack_node>::CAS (newptr=<optimized out>, oldval=..., this=0xb69c3c04 <ThrRegistry::pending>) at lockfree/atomic_ptr.hpp:89
#2  lockfree::intrusive_stack<LifoCell>::push (node=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at lockfree/stack.hpp:76
#3  Lifo::Push (cell=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flcontainers.h:29
#4  TypedLifo<thr_entry>::Push (c=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flcontainers.h:39
#5  PooledLifo<thr_entry, 1, 10>::Push (c=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flcontainers.h:78
#6  ThrFinder<PooledLifo<thr_entry, 1, 10> >::Push (e=<optimized out>, this=0xb69c3c04 <ThrRegistry::pending>) at flthr.cpp:90
#7  flext_shared::LaunchThread (meth=meth@entry=0xb69a8189 <flext_base_shared::QWorker(flext_shared::thr_params*)>, p=0x0) at flthr.cpp:278
#8  0xb69a8238 in flext_base_shared::StartQueue () at flqueue.cpp:533
#9  0xb699e73a in flext_base_shared::AddMessageMethods (c=0x602638, dsp=<optimized out>, dspin=<optimized out>) at flext.cpp:158
#10 0xb699e7b2 in flext_base_shared::Setup (id=0x6026a0) at flext.cpp:189
#11 0xb699f59e in flext_obj_shared::obj_add (lib=<optimized out>, dsp=<optimized out>, noi=<optimized out>, attr=<optimized out>, idname=0xb6b390f0 "simple1", names=0xb6b390f0 "simple1", 
    setupfun=0xb6b39025 <simple1::__setup__(flext_class*)>, newfun=0xb6b38e9d <simple1::__init__(int, _atom*)>, freefun=0xb6b38f8d <simple1::__free__(flext_hdr*)>, argtp1=0) at fllib.cpp:383
#12 0xb6b38f74 in simple1_setup () from ./simple1.pd_linux
#13 0x0053ed62 in ?? ()

"Bus Errors" on arm typically indicate unaligned memory access...

hmm, it seems this is related to my own build-system hacks.

at least, building both flext and externals with the "normal" flext-build system (flext/build pd gcc) appears to work fine.

i stand corrected again.

building both flext and externals with the "normal" flext-build system will not result in a "Bus Error", but instead the external will hang Pd and consume 100% of a CPU (so I guess it just entered some endless loop).

so there is an issue with flext itself :-(

grrrr commented

Thank you, i will have a look. That might be the time to bring in boost::atomic instead of hard to maintain self-made code.

grrrr commented

Do you get the ../../source/lockfree/cas.hpp:217:9: warning: #warning blocking CAS2 emulation [-Wcpp] warning on compilation?
That is what i have on armhf with gcc (Raspbian 10.2.1-6+rpi1) 10.2.1 20210110 and it's definitely bad. On the other hand, i don't see problems on loading.
The crash seems to originate from the __sync_bool_compare_and_swap_8 intrinsic though which points to another source of the problem.

the logs for building libflext can be accessed on https://buildd.debian.org/status/package.php?p=pd-flext and the logs for the test (which builds an external and links it with libflext) can be accessed on https://ci.debian.net/packages/p/pd-flext/

the actual test can be found on https://salsa.debian.org/multimedia-team/pd/pd-flext/-/tree/master/debian/tests (but it's really just compiling tutorial/3_attr1 and then running a simple test-patch on it.

grrrr commented

Hi thank you will test the debian source package myself.
From the logs, i find it a little strange that the testbed kernel announces itself as arm64, testing armhf architecture.

that's because the tests are obviously run on an arm64 CPU (which can execute armhf instructions, similar to an x86_64 CPU which can also run i386 binaries).

however, I also conducted tests on the OdroidXU4, which is a 32bit arm CPU, with the same results.

funnily enough, it seems that the same test succeeds when run on a "Raspberry Pi 4" (using Rasbian/buster in armhf (32bit) mode)

also i just noticed that the OP said:

without using FLEXT_USE_CMEM

(which probably somehow implies, that it does work when using with FLEXT_USE_CMEM).

i would like to stress, that i am building with -DFLEXT_USE_CMEM.

and here's some output of valgrind:

==664373== Process terminating with default action of signal 7 (SIGBUS)
==664373==  Invalid address alignment at address 0x52F15BC
==664373==    at 0x52CFD5C: UnknownInlinedFun (cas.hpp:129)
==664373==    by 0x52CFD5C: UnknownInlinedFun (atomic_ptr.hpp:89)
==664373==    by 0x52CFD5C: UnknownInlinedFun (stack.hpp:76)
==664373==    by 0x52CFD5C: UnknownInlinedFun (flcontainers.h:29)
==664373==    by 0x52CFD5C: Push (flcontainers.h:39)
==664373==    by 0x52CFD5C: Push (flcontainers.h:78)
==664373==    by 0x52CFD5C: Push (flthr.cpp:90)
==664373==    by 0x52CFD5C: flext_shared::LaunchThread(void (*)(flext_shared::thr_params*), flext_shared::thr_params*) (flthr.cpp:278)
==664373==    by 0x52D70FF: flext_base_shared::StartQueue() (flqueue.cpp:533)
==664373==    by 0x52CC905: flext_base_shared::AddMessageMethods(_class*, bool, bool) (flext.cpp:158)
==664373==    by 0x52CC97D: flext_base_shared::Setup(flext_class*) (flext.cpp:189)
==664373==    by 0x52CD821: flext_obj_shared::obj_add(bool, bool, bool, bool, char const*, char const*, void (*)(flext_class*), flext_obj_shared* (*)(int, _atom*), void (*)(flext_hdr*), int, ...) (fllib.cpp:383)
==664373==    by 0x51113CB: attr1_setup (in /home/umlaeute/umlaeute-pd-flext/pd-flext-0.6.2/tutorial/3_attr1/attr1.pd_linux)
==664373==    by 0x17CE33: ??? (in /usr/bin/puredata)

so with the feedback from #50 (comment), my current workaround for this issue is to force the use of the blocking CAS/CAS2 emulation on the affected architectures (armhf & armel).

the patch can be found at the Debian pd-flext repository and is basically adding another (set of) define(s), namely USE_BLOCKING_CAS and USE_BLOCKING_CAS2. Setting these at build time will just skip to the blocking implementations. I might have missed some ifdef'ed implementation (e.g. the _MSC_VER block is obviously not skipped if USE_BLOCKING_CAS, which is mostly because i only really care about Debian, where we don't use microsoft compilers...)
The point of the patch is, that the forcing of the blocking-behaviour is a purely opt-in.

This is not optimal, or - to put it with @grrrr's words:

it's definitely bad

however, my reasoning is, that a "definitely bad", blocking, non-realtime safe workaround is much better than a crash at startup.
also, i see little harm on the armel architecture, which lacks a hard-float unit so i don't think anybody is actually using such a device to run Pd (not to mention flext).
Things are of course a bit different with armhf (which is the architecture of Raspberry Pi OS/32bit). But then:

  • with newer RPis, i guess/hope/urge that people will switch to 64bit (aarch aka arm64) where there seems to be no problem
  • with RPiOS, it seems to already fall back to the blocking default (i think the reason that you are seeing the fallback and I am seeing the crash is, that I am using an ordinary Debian installation, and you are using a Raspbian installation, which is known to have a lower baseline CPU (also targeting the RPi0 and RPi1), so __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 is probably not defined by your compiler...)
grrrr commented

Thank you IOhannes, for the clarification. That makes a lot of sense to me.
I am sorry for being so slow in catching up with the issue.
My plan would be to see if boost::atomic could help with this. I would like to outsource the explicit handling of architectures, also for the future.