Kernel panic after the boot in Lenovo T440
duselguy opened this issue · 38 comments
That's interesting. Unfortunately, it happens only on real HW and doesn't happen on any of the machines I own, so it will be harder to debug.
Two questions:
- How much RAM does the machine have?
- Which screen resolution did you choose?
A few experiments to do:
-
Try with commit 9adb423 and see if the same thing happens
-
In the interactive bootloader, type 'e' and then ENTER to edit the command line and write there:
-cmd /initrd/usr/bin/dp
Hopefully, at least the debug panel will run. At that point, take screenshots of screens 1 to 3. You can navigate using regular digit keys (1,2,3, etc.). Scroll using PAGE_UP and PAGE_DOWN.
If that doesn't ring any bells for me, it will be necessary for me to prepare several instrumented builds and ask you to try them.
Last commit is fae15c7 .
- RAM 8GB
- Default 800*600*32
- After dp the first screen appears, but I can't navigate, because the keyboard is blocked:
I tried the same on another book (Amilo):
Navigate in dp is ok, but after quit:
The same is in qemu, I thought that it should simply continue the boot.
Thanks for the test, Vladimir.
It’s sad that the keyboard doesn’t work. Have you tried just pressing the caps lock or the num lock and checking if the led turns on and off?
Also, can you please try with commit 9adb423?
About the second part, that’s not a bug. Unix kernels and Tilck must have init (pid 1) running all the time. If that process exits, panic is the default behavior. In this case, I made you run the debug panel instead of init (that’s a hack!) so, after you quit, you see that the kernel panics. Typically dp is run from the shell and you could exist from it without issues.
- The book has no leds for caps/num. Only Fn led (that has no sense in our case).
- Tried with 9adb423 - the same result as above.
Regards,
Vladimir
P.S. You can prepare a specific version with your hooks to get more info about the problem.
Thanks Vladimir. Sure, I'll prepare an instrumented build to debug this.
Hello Vladimir (@duselguy),
I've instrumented the code to debug this issue. Unfortunately, I'm extremely busy at the moment, so I hope this will be enough to debug the issue.
Instructions:
- Delete your build directory (with rm -rf)
- Check out the
debug_issue_98branch - Build as usual and flash Tilck's image into an USB drive
- Boot Tilck on your Lenovo machine with the following options:
-pk -bb - If the bug is deterministically reproducible, you should immediately see a panic like:
Paddr: 0x00aabbcc has invalid ref-count: 0 - Reboot the machine and boot with one more option:
-pk -bb -debug_pa <the paddr you saw in the panic> - Do plenty of screenshots. Try scrolling the console buffer with SHIFT + PAGE_UP. Thanks to the -pk option, it should work even after panic.
- Post here the results :-)
@duselguy Btw, since you mentioned that on this machine the keyboard doesn't work, please add also the following two options to debug this issue: -ps2_log -ps2_selftest and make screenshots.
Sorry, I didn't find in the doc how to add options for the boot (4. above in your scenario). Please, advice.
Also, -ps2... options should be be used in the separate boot (without -pk -bb)?
Thanks.
OK, when you boot Tilck's flash drive, you first end up in Tilck's bootloader.
If you press 'e' and then ENTER there, you should be able to edit the kernel's command line.
There, you can add those options. (It's the same you did when you used -cmd /initrd/usr/bin/dp)
Yes, it makes sense to just use -pk -bb and then -pk -bb -debug_pa <paddr> for debugging the panic itself on the debug branch and then, as a separate task, to use -ps2_log -ps2_selftest -pk on the master branch to observe the PS/2 logs.
Ok, I missed or misinterpreted "and to edit kernel's cmdline." from the doc.
In any case the 1st scenario doesn't work as described (on debug_issue_98 with commit f44053a5e).
- I entered -pk -bb then boot
- Received kernel panic and Paddr: address 0x37f3e000 has invalid ref_count: 0
- I entered -pk -bb -debug_pa 0x37f3e000 then boot
- The same picture with kernel panic, etc.
- shift+PgUp doesn't work
P.S. If -pk doesn't work is there sense to try the 2nd scenario with master?
@duselguy What do you mean exactly by "doesn't work as described" ?
debug_issue_98 is just an instrumented build, not a fix. It is expected to panic exactly as before, just it will hopefully print some diagnostic data before data. Let me show you a screenshot when -debug_pa 0x22c000 is passed in the cmdline.
Clearly, there's no crash/panic but I can see the places where the ref-count changes for the given physical page. Can you please send the screenshot in your case? Idea: pick up a higher resolution in the boot menu (v), so that we'll see more contents on the screen.
The options -pk and -bb are pointless if the PS/2 keyboard doesn't work. You'll have to test them separately on the master branch with -ps2_log -ps2_selftest.
@duselguy Thanks for tests and the clarification, Vladimir!
What I see is super weird and unexpected: it looks like that page has never mapped, while it is in a page directory. Maybe the memory there is dirty somehow? I added many more checks to the debug_issue_98 branch.
Can you please re-run the test with the latest commit in the debug_issue_98 branch (63a8244) ?
Note: it's enough to boot with -debug_pa 0x37f3e000. The other options are pointless, if the keyboard does not work.
@duselguy Hello Vladimir. Thanks for the update!
The info in this screenshot made the difference. I believe I've discovered and fixed the issue in the debug_issue_98 branch.
What happened is that the ramdisk (RDSK in the mmap) was placed by the UEFI bootloader past the end of the usable physical memory by Tilck (896 MB) and the page with ref-count zero belongs to the ramdisk. Because of that, Tilck does not update the ref-count for pages outside of that area and actually won't able to access them through linear mapping from the kernel. Even if the ref-count was kept correctly, this bug would lead to other incorrect behaviors later.
Extra details: the 896 MB limitation comes from the fact that on 32-bit systems, we cannot have more than 4 GB of virtual memory. Because of the way virtual memory is used on Linux, 3 GB are reserved for userspace, while 1 GB is reserved for the kernel. The first 896 MB are used to map directly the physical memory into the kernel address space, the remaining 128 MB are used for special purposes. In order to support for than 896 MB of physical memory, Linux has a feature called "high-mem", which substantially increases the complexity of the OS by mapping on-the-fly physical memory beyond that limit on the virtual space. Tilck does not have this feature and won't have it in the future simply because on small-scale systems (Tilck's target) there is << 896 MB of RAM. On desktop systems instead, it will be possible in the future to run in 64-bit mode, where this problem simply does not exist, so the current approach will work perfectly and it will be able to map all the physical memory in the virtual space.
Conclusion: can you please try again with the latest update of the debug_issue_98 branch and confirm that the problem has been solved? Note: the keyboard probably still won't work, but at least Tilck will boot.
@duselguy Hello Vladimir,
This looks like to me exactly the same issue. The ramdisk starts close to the 896 MB limit. That shouldn't be possible anymore.
Sorry for stupid question, but are you sure this is commit f64248da ? I did a push --force on that branch.
Vlad
I checked twice: with the image and USB stick under QEMU: it is commit f64248d .
Regards,
Vladimir
P.S. I checked also the commit related changes in tilck code in my local repo:

Thank you for double checking. I'll prepare another update to dig deeper in the booloader.
Is this correct for debug_issue_98 branch:
This branch is 1 commit ahead, 4 commits behind master.
Yes, the branch is not rebased on the top of master anymore, but don't worry about that. After we discover the problem, I'll create a dedicated commit on master for the fix.
So, I've updated the branch once more. The new commit is now 3330462
Now, after this screen:

you should see a screen like this:

Just the values will be different. Please post the screenshot of the second screen in your case.
You boot only using UEFI, right?
Thanks,
Vlad
Yes, [UEFI Only].
@duselguy OK, that means that the problem is not necessarily there. Maybe phys_mem_lim did wrap around because you have more than 4 GB of usable mem? I made some significant changes in commit 8df0dd4. Can you please test with that? Note: please make a screenshot of both the addresses in the bootloader and the panic message, as I added new stuff.
After checking the UEFI spec, I realized why my previous fix doesn't work. This code:
paddr = LINEAR_MAPPING_SIZE;
status = BS->AllocatePages(AllocateMaxAddress,
EfiLoaderData,
(ctx->rounded_tot_used_bytes / PAGE_SIZE) + 1,
&paddr);Does not need the fix to move paddr by ctx->rounded_tot_used_bytes simply because according to the UEFI 2.8 spec:
Allocation requests of Type AllocateMaxAddress allocate any available range of pages whose
uppermost address is less than or equal to the address pointed to by Memory on input.
So, the problem can really be the phys_mem_lim wrapping around the 32-bit limit.
@duselguy That's amazing!!! Thank you for helping me debugging this issue! :-)
So many ACPI errors! Of course most of them are about the lack of a handler for the Embedded Controller region (address space). I'll work on that someday. I'm perfectly aware that this affects ACPI poweroff and maybe even reboot on some machines.
Now, the KB is locked maybe because it's an USB keyboard and ACPI disabled the PS/2 emulation. Tilck has no support for USB, so it has zero chances to work. But, you can boot with -noacpi and see how it goes. You can add -ps2_log to see some diagnostic info about the communication with the PS/2 controller (probably emulated here).
- I thought (based on mnemonic SB only) these ACPI errors Sound Blaster related(?)
- I updated my previous comment with yet another screen (to not forget about this ACPI).
- Will try your suggestions with KB.
Thanks,
Vladimir
_SB in ACPI stands for System Bus . And yeah, it will require a fair amount of extra work to have full support for ACPI events on all the machines. Unfortunately, I don't have any time at the moment. I was just trying to fix the most critical bugs affecting things that I assume to work. If something doesn't work because there is no support for it, that's perfectly fine. Not a bug.
Anyway, Tilck's target architecture is ARM and maybe RISC-V in the future, so there's a reason while I'd be investing more on some things instead of others.
Agree with ARM/RISC-V target (!).
P.S. I'm working as a tester (not unit tester) without looking into the code/changes in this project (sorry, also have no enought time and knowledges now).
P.P.S. Please, don't close the issue before I'll check your suggestions for KB (don't like technical debt).
Sure, no problem man. I'm not in a hurry to close this issue. I'm curious too to see if with -noacpi you'll have at least emulated PS/2 working. Also, I'm very against technical debt too!
It's sad to hear that PS/2 keyboard doesn't work even with -noacpi. If you have the patience, try also with -ps2_log -ps2_selftest. It might show something. But I won't surprised at this point if nothing works on that machine, even if on all of my machines (6) it works. You might also check in the BIOS settings if you could enable "PS/2 keyboard emulation".
- Nothing about PS/2 in BIOS
- Nothing new with -ps2_selftest
- Will check on master branch at least boot complete (as I understand you should move something from debug_issue_98 branch code changes)
- Really/Fortunately I didn't plan to use this book with tilck -:)
Regards,
Vladimir
Thanks so much, man! 👍
NP, you are welcome!
Checked, fix is ok, thanks.

