Test failures on ARMv7

Question

Test failures on ARMv7

troplin opened this issue 9 years ago · 27 comments

Starting a new Issue from the discussion at PR #25

On my WD EX4100 NAS-System test fail:

running 6 tests
test context::tests::number_generator ... ok
test context::tests::resume_ontop ... ok
test context::tests::type_sizes ... ok
fatal runtime error: Could not unwind stack, error = 9
Process didn't exit successfully: `/context-rs/target/debug/context-30e53ad2e638d392` (signal: 4, SIGILL: illegal instruction)

To learn more, run the command again with --verbose.

Processor: ARMv7 HF (Marvell Armada 38x)
System: Ubuntu 15.10 inside a Docker Container on a BusyBox based host
Rust Version: Unofficial rust-1.8.0-nightly-2016-02-19-917db28 from RustBuild

Answer 1 · 2016-02-21T11:47:19.000Z

I think I might have found the reason for this. Rust wraps extern "C" methods into a small "trampoline" method which (as far as I understand it) aborts the process using SIGILL if it's hit by an exception.

Currently there is only a single point in the code where this could happen: the stack_alignment test.
Which means that ARM devices also run on unaligned stacks... Great.
If my assumptions are correct then the trampoline method is also the reason for the unaligned stacks, @zonyitoo.
So thanks Rust, for f****ing up the C ABI...

Answer 2 · 2016-02-21T12:47:02.000Z

Also regarding the first issue (page size of 32 KB):
I've searched a bit on the internet and the CPU (Marvell Armada 38x) doesn't even seem to support that page size... Very suspicious. Where could this number come from?

Do you think I should contact the manufacturer or is it normal to mess with the page size configuration?

Answer 3 · 2016-02-21T12:58:24.000Z

@troplin , what @lhecker points out is a problem that the stack pointer is not aligned properly. In all platforms, some instructions require the access address are aligned, such as 16bytes, 8bytes. But we found that in the current implementation, when the execution goes into the entry function, the stack pointer is not aligned properly, which will make the whole program crash or produce some unexpected results.

Answer 4 · 2016-02-21T13:14:59.000Z

@zonyitoo , yes I understand.

I was referring to my first test run that I posted on #25 which gave me a different result:

[...]
test stack::tests::stack_size_too_small ... FAILED

failures:

---- stack::tests::stack_size_too_small stdout ----
    thread 'stack::tests::stack_size_too_small' panicked at 'assertion failed: `(left == right)` (left: `32768`, right: `8192`)', src/stack.rs:216
note: Run with `RUST_BACKTRACE=1` for a backtrace.
[...]

This was apparently caused by the page size of 32KB which is a bit unusual. @lhecker probably fixed this but it's still a bit suspicious.

Answer 5 · 2016-02-21T13:23:34.000Z

The page size is get from _SC_PAGESIZE. The code to get minimum stack size and maximum stack size is also in the same file. Maybe you could write a simple test program in C to test if the data is as @lhecker assumed.

Answer 6 · 2016-02-21T13:30:47.000Z

I guess it's correct:

root@9e54b0b4fb85:/context-rs# getconf PAGESIZE
32768

But why the system reports a page size that is not even supported by the CPU is beyond me.

Answer 7 · 2016-02-21T13:33:14.000Z

If the configure is not support by the CPU, so why the OS can run properly? Hmm.. strange.

Answer 8 · 2016-02-21T13:43:58.000Z

Hm, I just tried to setup an openssh server in a docker container and this is what I got:

Setting up dbus (1.10.0-1ubuntu1) ...
dbus-uuidgen: error while loading shared libraries: libsystemd.so.0: ELF load command alignment not page-aligned
dpkg: error processing package dbus (--configure):
 subprocess installed post-installation script returned error exit status 127

It seems that it also causes other problems. I'll try to contact WD...

Answer 9 · 2016-02-21T13:44:25.000Z

@troplin LoL.

Answer 10 · 2016-02-21T16:00:02.000Z

I fixed the SIGILL error just now though... It won't really help with this though I fear.
On the other hand I still have 2 jailbreaked iPhones (ARMv7s and ARMv8) on which I could test this. It's a bit hard to set this up though so I won't come around doing this until next week.

Answer 11 · 2016-02-21T16:44:33.000Z

Ok SIGILL is fixed but not stack alignment:

root@9e54b0b4fb85:/context-rs# cargo test
   Compiling context v1.0.0 (file:///context-rs)
     Running target/debug/context-30e53ad2e638d392

running 6 tests
test context::tests::number_generator ... ok
test context::tests::resume_ontop ... ok
test context::tests::type_sizes ... ok
test stack::tests::stack_size_too_large ... ok
test context::tests::stack_alignment ... FAILED
test stack::tests::stack_size_too_small ... ok

failures:

---- context::tests::stack_alignment stdout ----
    thread 'context::tests::stack_alignment' panicked at 'assertion failed: `(left == right)` (left: `8`, right: `0`)', src/context.rs:188
note: Run with `RUST_BACKTRACE=1` for a backtrace.


failures:
    context::tests::stack_alignment

test result: FAILED. 5 passed; 1 failed; 0 ignored; 0 measured

@lhecker Anything I can to to help you?

Answer 12 · 2016-02-22T14:37:24.000Z

@troplin Well I'm still trying to find out if this is an issue with Boost.Context or with Rust's ridiculously annoying trampoline method (I contacted the author of Boost.Context in the meantime - I hope he replies). I'm just not that good in giving answers in this case, because I lack a lot of knowledge about this kind of low level development.

I believe that you could try to change this line though to use a value of #52 or #68 instead, because AFAIK it determines the offset of the first stack frame relative to the base stack address. As you can see in the ASCII graph at the beginning of that file the actual context data is only 0x30 bytes large (or 48 in dec.).

Answer 13 · 2016-02-22T18:33:28.000Z

@lhecker Both offsets seem to work, at least according to cargo test. I don't know how extensive those test are though...

Anyway, my current setup using master seems to do the job for now. So, no hurry!

Answer 14 · 2016-02-23T01:14:50.000Z

@lhecker I was trying to make the jump_fcontext and ontop_fcontext to act just like a Rust method (by adding some extra instructions to make it look like a normal Rust fn method call).

I think it is worth to investigate Rust's calling convention, and then make all those callback functions to be a normal fn instead of extern "C" fn.

I am sure that std::panic::propagate cannot pass through the extern "C" fn, which is very bad.

Answer 15 · 2016-02-24T18:09:02.000Z

Hey @troplin I tested the 1.0 branch on armv7s and aarch64 just now and I think I fixed the rest of your issues too!

Answer 16 · 2016-02-24T20:56:41.000Z

I can confirm that test pass now.
Did you get an answer from the author of Boost.Context?

Answer 17 · 2016-02-24T21:02:12.000Z

Yup @troplin. I forwarded our findings to him already and he's in the process of fixing it already.

The only "important" platform (32/64-Bit of x86 & ARM) that's not working now is the i386 Version for Windows (see appveyor results) but I wrote him that already and will wait for him to test it.

cc: @zonyitoo

Answer 18 · 2016-02-24T21:35:29.000Z

So was it a bug in Boost.Context after all?
Or a combination of that and the Rust trampoline?

Answer 19 · 2016-02-24T21:48:38.000Z

Yeah it's a bug in Boost.Context after all…
The trampoline for extern Rust methods is honestly still a bit shitty though… 😅

Answer 20 · 2016-02-25T01:13:07.000Z

The trampoline would be inlined if you use --release build (hope so). 😅 But based on the output ASM, yeah, LLVM is quite clever!

Answer 21 · 2016-02-27T23:15:13.000Z

Okay an Update (@zonyitoo):

All (or at least most) platforms should be fixed in Boost.Context now. There is only one thing though:
The reason the alignment test failed is actually not really because of Boost.Context and my "ARM fix" was wrong. Instead my testing method assumed that LLVM would compile similiar code on ARM as it does on x86.
(LLVM aligns stack frames and thus each of the first local variable in a function to 16 Bytes to allow for SSE optimizations and as you might know SSE needs a fixed 16 Byte Alignment.)

This is not the case on ARM though, because ARMs ABI neither mandates 16 Byte Stack Frame Alignment (but a 8 Byte one) nor does it align local arrays to 16 Bytes, because the required alignment for NEON operations is variable (you can specify anything from "unaligned" up to a 128 Byte Alignment AFAIK). LLVM thus tries to be smart here and aligns local data as compact as possible.

I think we should mark the alignment test with #[cfg_attr(feature = "nightly", test)] and use the nightly #[repr(simd)] feature to create a SIMD struct on the stack in the context function. That way LLVM will align the struct to the specific SSE or NEON alignment. After that we can check the misalignment by computing the the address of the local data modulo the std::mem::align_of() of the SIMD struct and check if it's 0 (currently we always check for 16 Byte Alignment instead).

Answer 22 · 2016-02-28T07:47:17.000Z

@lhecker So you mean for all users on ARM, they have to manually create a dummy struct inside the context function to let LLVM to align? Or we only need to do that in our test?

Answer 23 · 2016-02-28T11:48:54.000Z

You should read it again, @zonyitoo. 😄
So yeah: It's only required for the test to see if the stack frame fulfills the required ABI alignment by checking it's address modulo it's align_of() (which should be 16 on x86 to work with SSE and some dynamic value (but probably 8) for ARM).

Answer 24 · 2016-02-28T11:50:35.000Z

@lhecker 😄 Beg for forgiveness.

Answer 25 · 2016-02-28T11:52:16.000Z

Basically: The assumption that every Arch aligns arrays etc. to 16 Bytes is (now obviously) wrong: Instead we really need to use a dynamic solution with #[repr(simd)] and align_of().

Answer 26 · 2016-03-01T13:57:39.000Z

This issue should now be resolved with the latest commit, which is why I'm closing it for now. 😊

Answer 27 · 2016-03-04T15:17:55.000Z

Thanks for all the work.