svartalf/rust-battop

Stack overflow on battery removal

rkost opened this issue · 9 comments

rkost commented

Description

First world problem: When removing the (only) battery that is installed on my system, battop crashes:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1] 23948 abort (core dumped) battop

The terminal becomes unusable after battop crashed (random inserted characters when using the mouse, sigterm will not work).

System info:

  • Arch linux, kernel 5.1.3
  • Thinkpad P71
  • battop 0.2.3

Additional notes

  • battop works fine when not removing the battery. Temperature monitoring does not work (not available), but I guess this is something system-specific.
  • The reason I am swapping batteries is to be able to charge my second one

Let me know if there is any information that is missing.

Oh, snap :( Thank you for this huge bug report.

Unfortunately, the battery in my notebook is soldered to the motherboard, so I'm not sure how can I reproduce this issue without the hammer and a screwdriver help :|

Can you attach the core dump created? I believe in can be done with a coredumpctl dump battop -o battop.dump shell command. While it was produced from the release version (without debug symbols and etc), it still might be very helpful.

rkost commented

Haha, yeah. Most modern notebooks have soldered batteries these days. I probably wouldn't think of some maniac removing the battery during runtime either :D

I have compiled battop for the develop configuration and reproduced the error. Since github does not like the dump's file size I have uploaded it to my server. I hope this helps. Let me know if you need more information or want to test something.

Output of coredumpctl:

rkost@rkost-arch-p71  ~/repos/rust-battop   master  coredumpctl dump battop -o battop.dump 
           PID: 15024 (battop)
           UID: 1000 (rkost)
           GID: 100 (users)
        Signal: 6 (ABRT)
     Timestamp: Sun 2019-06-02 15:40:22 CEST (16s ago)
  Command Line: ./battop
    Executable: /home/rkost/repos/rust-battop/target/debug/battop
 Control Group: /user.slice/user-1000.slice/session-1.scope
          Unit: session-1.scope
         Slice: user-1000.slice
       Session: 1
     Owner UID: 1000 (rkost)
       Boot ID: 1ecfb713969249499bc3d20e0c424b50
    Machine ID: 9799e596c3c24aed8df7672be00906b3
      Hostname: rkost-arch-p71
       Storage: /var/lib/systemd/coredump/core.battop.1000.1ecfb713969249499bc3d20e0c424b50.15024.1559482822000000.lz4
       Message: Process 15024 (battop) of user 1000 dumped core.
                
                Stack trace of thread 15024:
                #0  0x00007fdae318d82f raise (libc.so.6)
                #1  0x00007fdae3178672 abort (libc.so.6)
                #2  0x000056176cc055b7 n/a (/home/rkost/repos/rust-battop/target/debug/battop)
More than one entry matches, ignoring rest.

Awesome response time btw ;)

Awesome response time btw ;)

Accidental luck, I suppose :)

Could you also attach the exact /home/rkost/repos/rust-battop/target/debug/battop binary, which had produced that dump? It seems that our compilers suite are different, because I was not able to get anything useful from the core dump yet. That's what I got at the moment with help of gdb and lldb:

(gdb) bt full
#0  0x00007fdae318d82f in ?? ()
No symbol table info available.
#1  0x0000000000000400 in ?? ()
No symbol table info available.
#2  0x000056176cc196bb in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.
(lldb) bt
* thread #1, name = 'battop', stop reason = signal SIGABRT
  * frame #0: 0x00007fdae318d82f
    frame #1: 0x00007fdae300c1a0
(lldb) f 0
frame #0: 0x00007fdae318d82f
error: read memory from 0x7fdae318d82f failed
(lldb) f 1
frame #1: 0x00007fdae300c1a0
->  0x7fdae300c1a0: addl   %eax, (%rax)
    0x7fdae300c1a2: addb   %al, (%rax)
    0x7fdae300c1a4: addb   %al, (%rax)
    0x7fdae300c1a6: addb   %al, (%rax)

These in ?? () lines are not helpful, and locally generated battop binary seems to have the different memory layout, so I have no idea yet what causes the panic.

I'll try to check the Thinkpad driver sources now, maybe there is some unobvious behavior in it.

So far I had not found anything interesting, except the thing that upower does an additional check before reading the /sys/class/power_supply/ files — it checks if it is a regular file, while battop does not care about it and this might be a problem if device driver decides to mess everything up during the battery unplugging.
It still does not explain the stack overflow and I doubt that this is a reason, since buffer for a file reading is located at heap, but it is worth to check it out.

May I also ask you to run that command before and after the battery swap? Obviously, if you are okay with that.

$ find -L /sys/class/power_supply -maxdepth 2 -printf "\n%p\n" -exec cat {} \;
rkost commented

Binary can be found here.

A backtrace from gdb can be found here (it is quite long as you would expect it ^^)

Maybe I should have mentioned, that battop crashes right after I remove the battery. I have not inserted a new one at this point. I have created two log outputs of your command:

Amazing, thank you very much! That backtrace is really helpful, here is a short explanation for this bug:

  1. battop is backed by a battery crate
  2. battery is creating a struct for each battery found, which represents the battery device and stores the path to a /sys/class/power_supply/* folder
  3. This struct is cached by battop and information in it is refreshed each second (except the path, which is "constant", since batteries are just plugged in usually ;) )
  4. When you are unplugging the battery, Linux removes the respective /sys/class/power_supply/* folder, but battop still has the cached path, so..
  5. When it tries to read the new information, there are no files exists and due to another well hidden bug at these lines it falls into a infinite recursion :|

This was an interesting investigation, I should say :) Thank you once again for your invaluable help!

I've created the issues to handle these bugs and will try to fix them and publish the new battop version tomorrow:

I'm sorry for a delay.

So, all related bugs were fixed (and it seems that they were caused by a stupid copy-n-paste error) and necessary changes were made in the issue-8 branch.

I'm attaching the "debug" version build from this branch, if you want to test it out: battop.gz, or, as usual, with a git, you can check out the issue-8 branch and build it manually.

Since the bug should be covered now, I'll wait till tomorrow and publish the new version then.

UPD: With a current behavior battop will close gracefully with an exit code > 0, which it still better than panicking and messing up the terminal :)

rkost commented

Nice! Can confirm that battop will just close with some exit code != 0. I guess an error message would be nice for the future but for now this is totally fine. Thanks for fixing!

You are absolutely right, this is not the best user experience, but I'm not sure yet how exactly should battop behave in this case. I had started working on this, see the svartalf/rust-battery#31 issue, so it will be better in the next versions.

Thank you for the bug report and the testing, there would be no chance I would fix that alone :)

Release process is finished already and AUR package is up to date now.