SITL: crash on Apple M1 when the simulator is compiled for arm64 and --speedup is not specified at startup
Opened this issue · 19 comments
Bug report
The SITL simulator crashes on macOS after connecting to it if --speedup is not set to an explicit value (e.g,. 1) from the command line.
Steps to reproduce:
- Get a Mac with an Apple Silicon CPU.
- Compile the SITL on macOS for
arm64
and run it with./build/sitl/bin/arducopter --model copter
- Connect to its TCP port
- Notice that the SITL crashes with "illegal instruction"
Version
ArduCopter 4.1.2, after cherry-picking the commit that fixes the floating-point exceptions on Apple Silicon.
Platform
[X] All
[ ] AntennaTracker
[ ] Copter
[ ] Plane
[ ] Rover
[ ] Submarine
Airframe type
N/A.
Hardware type
SITL
Logs
❯ build/sitl/bin/arducopter --model copter
Suggested EK3_BCOEF_* = 16.288, EK3_MCOEF = 0.209
Starting sketch 'ArduCopter'
Starting SITL input
Using Irlock at port : 9005
bind port 5760 for 0
Serial port 0 on TCP port 5760
Waiting for connection ....
Connection on serial port 5760
bind port 5762 for 2
Serial port 2 on TCP port 5762
bind port 5763 for 3
Serial port 3 on TCP port 5763
MKFIFO failed with File exists
MKFIFO failed with File exists
Home: -35.363262 149.165237 alt=584.000000m hdg=353.000000
Smoothing reset at 0.001
Closed connection on serial port 0
Waiting for connection ....
Connection on serial port 5760
bus.bus=0 address=0x55
bus.bus=1 address=0x55
bus.bus=0 address=0x38
bus.bus=0 address=0x39
bus.bus=1 address=0x38
bus.bus=1 address=0x39
bus.bus=2 address=0x38
bus.bus=2 address=0x39
bus.bus=3 address=0x38
bus.bus=3 address=0x39
bus.bus=0 address=0x38
bus.bus=0 address=0x39
validate_structures:490: Validating structures
zsh: illegal hardware instruction build/sitl/bin/arducopter --model copter
Additional info
lldb
pins down the crash to this part of the code:
Process 29663 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e390008)
frame #0: 0x0000000100166ad8 arducopter`AP_Logger_File::io_thread_alive(this=0x0000000100a04d00) const at AP_Logger_File.cpp:1026:20
1023 // SITL speedup options, so we allow for it here.
1024 SITL::SITL *sitl = AP::sitl();
1025 if (sitl != nullptr) {
-> 1026 timeout_ms *= sitl->speedup;
1027 }
1028 #endif
1029 return (AP_HAL::millis() - _io_timer_heartbeat) < timeout_ms;
Target 0: (arducopter) stopped.
Printing the value of sitl->speedup
reveals that it's -1, so we end up with a negative timeout:
(lldb) print sitl->speedup
(AP_Float) $0 = (_value = -1)
Maybe you have a bad value in eeprom, try launching with -w
to wipe.
I forgot to mention that the same thing happens even with a clean EEPROM. I guess the problem is that the default value of the speedup parameter in libraries/SITL/SITL.cpp
is -1, so if it is not overridden from the command line or from a default param file, it probably remains -1.
Edit: it seems like SIM_Aircraft.cpp
sets its own speedup parameter to 1.0 by default and it is responsible for updating the SITL speedup parameter in fill_fdm()
, but the code looks like this:
if (is_equal(last_speedup, -1.0f) && !is_equal(get_speedup(), 1.0f)) {
sitl->speedup = get_speedup();
}
which basically means that the SITL speedup gets updated only if the new value is not 1.0.
Key question: why is it different on MacOSX?
What it build with clang ?
What it build with clang ?
I've tried to replicate on Linux with both g++ and clang++ without luck.
I managed to confirm that the issue arises only when compiling for Apple M1 (i.e. arm64
). Forcing the SITL to be compiled for x86_64
produces an executable that does not crash.
Posting the Clang version number here for sake of posterity:
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
FWIW, stil->speedup
in AP_Logger_File.cpp
in the io_thread_alive()
function is still -1 even when compiling for x86_64
, so it still yields a negative timeout value in timeout_ms
. I believe that a negative timeout at this part of the code is incorrect; the intention was probably to use 1 if no timeout was specified in the command line.
@ntamas Would you like to put a PR in? This seems quite reasonable that at this line we could get some oddities here?
I've filed a PR but this is more like a stopgap measure than an actual fix. I believe that the real problem is somewhere in SIM_Aircraft::fill_fdm()
, which explicitly excludes updating stil->speedup
when the speedup is equal to 1 (the default), and there must be a reason for that.
Do you have MavProxy working on your M1 Mac? I haven't been able to get that to work so can't get to this point.
No, I did not bother setting that up. mavproxy
depends on wxPython
, which does not have Apple SIlicon wheels yet so I would have had to complie that from source. I just connect straight to the main TCP port of the simulated instance from Skybrush and use that to interact with the simulated instance.
Edit: I tried to compile wxPython manually but there are tons of additional dependencies that I needed to install from Homebrew and I was still getting errors so I gave up for the time being. I think that the pure command-line parts of MAVProxy would probably work on Apple M1 but it would need to declare wxPython and related components as optional dependencies to facilitate using MAVProxy without a GUI.
Edit 2: forgot to mention; to reproduce this error you don't need MAVProxy, just start the SITL directly and then connect to TCP port 5760 with telnet
or socat
; this is enough to get to the point that causes the crash.
I have tried this again with ArduCopter 4.2.0-rc2. sitl->speedup
is still -1 in AP_Logger_File
, and it still doesn't cause a crash when compiled for x86_64
. I cannot test this on arm64
, though, because the SITL does not compile cleanly for arm64
. I need to remove all instances of PACKED
from LogStructure.h
to make it compile, otherwise I get tons of these warnings and ld
bails out:
ld: warning: pointer not aligned at address 0x10036218A (__ZN6Copter13log_structureE + 2 from ArduCopter/Log.cpp.44.o)
ld: unaligned pointer(s) for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Removing PACKED
makes the code compile, but then the SITL shows warnings about the log structure being invalid, and the logging never starts so it never hits the problematic part in AP_Logger_File
.
Okay, so I figured out that it is enough to remove PACKED
from struct LogStructure
in LogStructure.h
(everything else can remain PACKED
), and then the code compiles on Apple M1 as native arm64
, and the crash is still there.
I'm getting this same crash on my M1Pro even with --speedup 1 - but on a different line:
thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e390008)
frame #0: 0x00000001001fd0e0 arduplane AP_ESC_Telem_SITL::update(this=0x0000600000010000) at AP_ESC_Telem_SITL.cpp:45:23
42
43 #if HAL_WITH_ESC_TELEM
44 for (uint8_t i = 0; i < sitl->state.num_motors; i++) {
-> 45 update_rpm(i, sitl->state.rpm[sitl->state.vtol_motor_start+i]);
46 }
47 #endif
I have discovered that the cause of the problem in AP_ESC_Telem_SITL.cpp is that update_rpm() was taking a uint16_t as a parameter whereas sitl->state.rpm[] contained floats.
@ntamas I notice that sitl->speedup is an AP_Float, while timeout_ms is a uint32_t. It seems to me that it might help to rewrite the calculation to make it clearer when the conversion from float to int should be happening:
e.g.
timeout_ms *= sitl->speedup;
perhaps could be written as:
timeout_ms = (uint32_t)(sitl->speedup * (float)timeout_ms)
At least - I think that's what is supposed to be happening.
Key question: why is it different on MacOSX?
From the research I've done, it looks like the compiler on MacOSX is building in stricter type checking on parameter passing on implicit integer/floating point conversion which is generating a floating point exception. I found this quote:
It seems that on the M1, MacOS does not send a SIGFPE, but rather a SIGILL, when floating point exceptions are unblocked.
I think SIGILL = EXC_BAD_INSTRUCTION - so this is actually a floating point exception. I think it's possibly caused by a float being passed when the method requires a 2 byte uint16_t. If the float is passed as a float it might be 4 bytes, but I think c compilers sometimes pass floats as doubles, so it might even be 8 bytes on the stack.
I have discovered that the cause of the problem in AP_ESC_Telem_SITL.cpp is that update_rpm() was taking a uint16_t as a parameter whereas sitl->state.rpm[] contained floats
Well, but the compiler should know about that and implicitly convert the float
to an uint16_t
when calling update_rpm()
, right?
I notice that sitl->speedup is an AP_Float, while timeout_ms is a uint32_t. It seems to me that it might help to rewrite the calculation to make it clearer when the conversion from float to int should be happening:
That would be better, but still, a negative speedup does not make sense at that point and I don't know what an implicit conversion of a negative float
to an uint32_t
would do at that point of the code, so it still has to be figured out why sitl->speedup
can remain negative if I don't specify the speedup explicitly at startup.
While compile build/sitl/bin/arducopter, I saw these warnings
[723/882] Compiling libraries/AP_AHRS/AP_AHRS_DCM.cpp ../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:71:23: warning: implicit conversion from 'unsigned long' to 'float' changes value from 4294967295 to 4294967296 [-Wimplicit-const-int-float-conversion] if(scaledvalue >= 4294967295ul) ~~ ^~~~~~~~~~~~ ../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:99:23: warning: implicit conversion from 'unsigned long' to 'float' changes value from 4294967295 to 4294967296 [-Wimplicit-const-int-float-conversion] if(scaledvalue >= 4294967295ul) ~~ ^~~~~~~~~~~~ ../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:128:27: warning: implicit conversion from 'long' to 'float' changes value from 2147483647 to 2147483648 [-Wimplicit-const-int-float-conversion] if(scaledvalue >= 2147483647l) ~~ ^~~~~~~~~~~ ../../libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:163:27: warning: implicit conversion from 'long' to 'float' changes value from 2147483647 to 2147483648 [-Wimplicit-const-int-float-conversion] if(scaledvalue >= 2147483647l) ~~ ^~~~~~~~~~~ 4 warnings generated.
libraries/AP_PiccoloCAN/piccolo_protocol/scaledencode.c:71
// Make sure number fits in the range
if(scaledvalue >= 4294967295ul)
number = 4294967295ul;
else if(scaledvalue <= 0)
number = 0;
else
number = (uint32_t)(scaledvalue + 0.5f); // account for fractional truncation
should it be
if(scaledvalue >= 4294967295.0)
or
if((scaledvalue - 4294967295.0) > 0.0000001)
I am encounter "Floating point exception" while try takeoff in SITL, but I can't find exactly reason, please see #20869 ,
I am build on macOS using clang 13, while build Rover for physical board like F4BY, no warning at all.
Alongside the existing issue of crashing when speedup is not provided, I've found that quadplane crashes at different speeds based on the command, even when --speedup
is specified. I'm using the latest Xcode:
Xcode 16.0
Build version 16A242d
Side note: It's not clear to me why sim_vehicle.py output both -S
and --speedup
in the same command - it doesn't matter if I leave that out.
Crashes (after 10 seconds of connecting)
./build/sitl/bin/arduplane -S -w --model quadplane --speedup 1 --sysid 1 --slave 0 --sim-address=127.0.0.1
crashes quicker (within 3 seconds, even without connecting)
./build/sitl/bin/arduplane -S -w --model quadplane --speedup 1 --sysid 3 --slave 0 --defaults ./Tools/autotest/default_params/quadplane.parm --sim-address=127.0.0.1
Note: Start SITL to connect
The crash only happens once the vehicle starts, so I connected with mavproxy. I just keep it running in the background.
mavproxy.py --master=tcp:127.0.0.1:5760 --baudrate 57600
Good news is MAVProxy seems to work on arm64 macs.
Workaround
Use macOS rosetta. First, install it: https://support.apple.com/en-gb/102527. Compilation is slower though, not as slow as rust 😜️.
Building with rosetta
first setup python (I use uv)
uv venv .venv
source .venv/bin/activate
uv pip install empy==3.3.4 pexpect future dronecan pyserial setuptools
build
arch -x86_64 ./waf clean
arch -x86_64 ./waf configure --board sitl
arch -x86_64 ./waf plane
Running with rosetta
arch -x86_64 ./build/sitl/bin/arduplane -S -w --model quadplane --speedup 1 --sysid 1 --slave 0 --sim-address=127.0.0.1
Rosetta is slow
I've noticed the build is faster on arm64 Linux (docker) than on x86_64 macOS. In that case, you could use my docker repo to build/run Ardupilot for Ubuntu or Debian https://github.com/ben-xD/ardupilot-sitl-docker