damien-lemoal/buildroot

Elf2flt relocation problem on newer kernels

Closed this issue · 6 comments

Hi, I'm recently tweaking kernels and toolchains for 32-bit RISC-V No-MMU systems, I faced the same problem and I guess I came up with some solutions. So far I only tried 32-bit, but I guess they can apply to 64-bit as well.

On 5.9.4 kernel, seems relocation is done to wrong addresses during load_flat_file in fs/binfmt_flat.c, this is quite stubborn, after modifying datapos by 0x20 like this:

		datapos = ALIGN(realdatastart +
				MAX_SHARED_LIBS * sizeof(u32),
				FLAT_DATA_ALIGN) - 0x20; 

meanwhile change elf2flt.ld script like this on line 50:

	. = . + 4; /* add this */
	. = ALIGN(0x20) ;
	_etext = . ;

Then busybox can run without noticeable problem. I keep kernel executable file formats configuration like this, with unrelated stuff all disabled:

[*] Kernel support for scripts starting with #!     
[*] Kernel support for flat binaries                
[ ] Enable support for very old legacy flat binaries
[ ] Enable ZFLAT support                            
[ ] Enable shared FLAT support                      
[ ] Kernel support for MISC binaries                
[ ] Enable core dump support                                                                         

On latest(5.17) kernel, seems other changes have taken place, and busybox can run without any modification in kernel source or the linker script, so using latest kernel for this project seems a good choice(if 64-bit can also work). If you are interested in 32-bit support(I verified on both QEMU and custom FPGA rv32ima softcore), I'll clean my modified code up and PR.

For elf2flt changes, I have one patch still under test here: https://github.com/damien-lemoal/elf2flt. I tested that only on 64-bits and it works, most of the time... There are still cases where userspace crashes on start (/sbin/init as busybox shell), so I am not yet 100% sure this change is complete. If you could have a look and try on 32-bits riscv, this could be helpful in improving this.

Note that I did patch the kernel to update binfmt_flat loading as there was no way to support the gap between data and text that the loader adds by default. I think the patch went in 5.14 or 5.15, would need to check again.

I'll try the patch soon. By the way are the crashes caused by out-of-range relocations? In all my tests there are hundreds of "reloc outside program 0xffffxxxx" during flt loading and I have to skip this would-be-fatal problem to get busybox working.

[    0.132538] binfmt_flat: reloc outside program 0xfffbc824 (0 - 0x51f80/0x48640)
[    0.132588] binfmt_flat: reloc outside program 0xfffbc824 (0 - 0x51f80/0x48640)
[    0.132983] binfmt_flat: reloc outside program 0xfffbc8f4 (0 - 0x51f80/0x48640)
[    0.133360] binfmt_flat: reloc outside program 0xfffbc8f4 (0 - 0x51f80/0x48640)
[    0.133658] binfmt_flat: reloc outside program 0xfffbc894 (0 - 0x51f80/0x48640)
....

According to this it might be ordinary so I haven't look further into it yet.

Nope. Loading & relocation is always OK for me. The crashes are most of the time sig 11 seg faults. I initially suspected an incorrect relocation type handling leading to an invalid address being generated, but the crashes are not 100% reproducible. They are rare-ish when things work, but can be very sensible to any tiny change (and then I get 100% reproducible crashes). However, we have discovered recently some bugs with the SoC PLIC initialization which could cause problems. Not entirely sure. Using a Sipeed Maix Bit board for testing (Kendryte K210 SoC). We are now trying to stabilize boot (PLIC problems lead to problems with SD card detection etc). Once that is done, I wanted to revisit elf2flt to debug more if the crashes persist. You trying on 32-bits could help giving hints :)

Interesting... Actually, just today when tweaking binfmt_flat.c(the DATA_START_OFFSET_WORDS macro, it's during debug so literally slightly wrong loader + busybox with larger gap between data and text) the sig 11 also occurred -- in my case it's probably related to bin flat loading, because as I'm building minimal rv32 systems, I don't have PLIC enabled in QEMU and there's only CLINT.

Here's a panic dump, thought I don't think it can help much.

[    0.129339] binfmt_flat: sp=805d5ffc
[    0.129795] binfmt_flat: start_thread(regs=0x(ptrval), entry=0x80580044, start_stack=0x805d5fa0)
[    0.130578] init[1]: unhandled signal 11 code 0x2 at 0x8059906c
[    0.130839] CPU: 0 PID: 1 Comm: init Not tainted 5.17.0-rc4 #20
[    0.131067] Hardware name: riscv-virtio,qemu (DT)
[    0.131232] epc : 8059906c ra : 80598fb0 sp : 805d5f00
[    0.131407]  gp : 805c8f60 tp : 00000000 t0 : 00000000
[    0.131557]  t1 : 00000000 t2 : 00000000 s0 : 805d5fa4
[    0.131702]  s1 : 805d5fbc a0 : 805d5f08 a1 : 00000000
[    0.131849]  a2 : 805d5f80 a3 : 805d5f68 a4 : 00000000
[    0.132014]  a5 : ffffffff a6 : 805d5f80 a7 : 805d5f80
[    0.132177]  s2 : 805c8b08 s3 : 00000002 s4 : 805843fc
[    0.132339]  s5 : 805d5f08 s6 : 0000000e s7 : 00000000
[    0.132499]  s8 : 00000000 s9 : 00000000 s10: 00000000
[    0.132657]  s11: 00000000 t3 : 00000000 t4 : 00000000
[    0.132832]  t5 : 00000000 t6 : 00000000
[    0.132963] status: 00000080 badaddr: ffffffff cause: 00000007
[    0.134374] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.134887] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Yep, when I get a crash, they look like this. I have not done the address chasing to see where the code is. But given that the shell does not print anything at all, I suspect it is very early in the process start. Disassembling this is a pain with flat bin, so not easy to track...