lightning_hsmd at 100% CPU on cln 24.08 mainnet
daywalker90 opened this issue · 9 comments
I've seen this multiple times now. lightning_hsmd sits at 100% CPU and i have to restart cln to get it back to normal. Strangely this is not happening with my testnet nodes running on the same machine and the same binaries.
Concur. But I also see both lightning_hsmd and lightning_gossi keep running at 100% after I shut down lightingd:
$ sudo systemctl stop lightningd
$ sudo systemctl status lightningd
○ lightningd.service - Core Lightning daemon
Loaded: loaded (/etc/systemd/system/lightningd.service; disabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2024-09-10 15:14:36 CEST; 5min ago
Process: 17183 ExecStart=/usr/bin/lightningd (code=killed, signal=TERM)
Main PID: 17183 (code=killed, signal=TERM)
Tasks: 2 (limit: 9438)
Memory: 101.4M
CPU: 19min 34.980s
CGroup: /system.slice/lightningd.service
├─17252 /usr/bin/../libexec/c-lightning/lightning_hsmd
└─17278 /usr/bin/../libexec/c-lightning/lightning_gossipd
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17224 (sql) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17226 (bookkeeper) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17230 (peerswap) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17252 (lightning_hsmd) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17278 (lightning_gossi) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 73560 (circular) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: Stopped Core Lightning daemon.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Consumed 8min 33.185s CPU time.
sep 10 15:14:36 vlad-VirtualBox lightningd[17215]: Lost connection to the RPC socket.
sep 10 15:14:36 vlad-VirtualBox lightningd[17220]: Lost connection to the RPC socket.
$ top
top - 15:18:08 up 3:06, 2 users, load average: 3,62, 2,47, 1,64
Tasks: 295 total, 5 running, 288 sleeping, 0 stopped, 2 zombie
%Cpu(s): 0,4 us, 0,5 sy, 0,0 ni, 99,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
MiB Mem : 7941,8 total, 3416,4 free, 827,8 used, 3697,6 buff/cache
MiB Swap: 2048,0 total, 1215,6 free, 832,4 used. 6816,5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17252 vlad 20 0 5620 3328 3328 R 100,0 0,0 3:41.55 lightning_hsmd
17278 vlad 20 0 20148 8072 3592 R 100,0 0,1 3:33.24 lightning_gossi
1 root 20 0 167692 12244 8148 S 0,0 0,2 0:13.00 systemd
2 root 20 0 0 0 0 S 0,0 0,0 0:00.05 kthreadd
3 root 20 0 0 0 0 S 0,0 0,0 0:00.00 pool_workqueue_release
4 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker/R-rcu_g
5 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker/R-rcu_p
6 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker/R-slub_
7 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker/R-netns
They prevent my Ubuntu 22 from shutting down!
Hmmmm, the best way to debug this is:
sudo perf record --call-graph=dwarf -p `pidof lightning_hsmd`
Then sudo perf report
will let you explore what is taking all the time.
Obviously this only works when hsmd is consuming lots of CPU.
Without knowing what it's doing, this is hard to debug! Presumably something is hammering it with requests: if we can figure out what the request is, that might give us a clue!
Thanks! Ok, one more request, what does sudo strace -p $(pidof lightning_hsmd) -o /tmp/trace
give? It will run forever, so just let it run for a few seconds then Control-C.
From this report, and another one, it's slamming the read/poll loop. I might have to add some debug for this. It doesn't happen for me :(
My instance is running on a rpi4, the 100% cpu bug happens after approximately half a day of ok running, every day it seems.
I run the strace, results are attached.
trace.txt
Noticed my lightning_hsmd
was running at 100% today.
Samples: 204K of event 'cycles:P', Event count (approx.): 192624081792
Children Self Command Shared Object Symbol
+ 99.80% 0.00% lightning_hsmd libc.so.6 [.] __libc_start_main@@GLIBC_2.34
+ 99.80% 0.00% lightning_hsmd libc.so.6 [.] __libc_start_call_main
+ 99.80% 0.00% lightning_hsmd lightning_hsmd [.] main
+ 99.80% 1.03% lightning_hsmd lightning_hsmd [.] io_loop
+ 95.34% 1.38% lightning_hsmd [kernel.kallsyms] [k] system_call_vectored_common
+ 93.92% 1.60% lightning_hsmd [kernel.kallsyms] [k] system_call_exception
+ 89.70% 0.29% lightning_hsmd lightning_hsmd [.] daemon_poll
+ 88.26% 0.50% lightning_hsmd libc.so.6 [.] __poll
+ 85.50% 0.12% lightning_hsmd [kernel.kallsyms] [k] sys_poll
+ 85.36% 6.22% lightning_hsmd [kernel.kallsyms] [k] do_sys_poll
+ 42.76% 13.25% lightning_hsmd [kernel.kallsyms] [k] sock_poll
+ 32.89% 0.00% lightning_hsmd [unknown] [k] 0000000000000000
+ 31.10% 14.36% lightning_hsmd [kernel.kallsyms] [k] unix_poll
+ 24.14% 0.36% lightning_hsmd [kernel.kallsyms] [k] poll_freewait
+ 16.69% 6.36% lightning_hsmd [kernel.kallsyms] [k] __pollwait
+ 15.77% 15.76% lightning_hsmd [kernel.kallsyms] [k] fput
+ 10.78% 10.78% lightning_hsmd [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 8.86% 0.12% lightning_hsmd lightning_hsmd [.] io_ready
+ 8.86% 0.18% lightning_hsmd lightning_hsmd [.] do_plan
+ 8.33% 0.25% lightning_hsmd lightning_hsmd [.] do_read_wire
+ 8.13% 0.21% lightning_hsmd lightning_hsmd [.] do_read_wire_header
+ 7.87% 0.46% lightning_hsmd libc.so.6 [.] read
+ 7.33% 0.39% lightning_hsmd [kernel.kallsyms] [k] remove_wait_queue
+ 6.51% 0.63% lightning_hsmd [kernel.kallsyms] [k] add_wait_queue
+ 5.33% 5.33% lightning_hsmd [kernel.kallsyms] [k] __fget_light
+ 5.20% 0.22% lightning_hsmd [kernel.kallsyms] [k] ksys_read
+ 5.16% 0.07% lightning_hsmd [kernel.kallsyms] [k] get_free_pages_noprof
+ 4.92% 0.05% lightning_hsmd [kernel.kallsyms] [k] alloc_pages_mpol_noprof
+ 4.73% 0.22% lightning_hsmd [kernel.kallsyms] [k] vfs_read
+ 4.39% 0.15% lightning_hsmd [kernel.kallsyms] [k] __alloc_pages_noprof
+ 4.13% 0.64% lightning_hsmd [kernel.kallsyms] [k] get_page_from_freelist
+ 3.15% 0.09% lightning_hsmd [kernel.kallsyms] [k] post_alloc_hook
+ 3.07% 3.06% lightning_hsmd [kernel.kallsyms] [k] kernel_init_pages
+ 2.83% 0.54% lightning_hsmd [kernel.kallsyms] [k] arch_local_irq_restore
+ 2.64% 0.22% lightning_hsmd [kernel.kallsyms] [k] sock_read_iter
OK, this did it! I found the case where we get a 0 read and ignore EOF. Thankyou, everyone, for your patience: I've added this to the v24.08.1 milestone, which will obviously need to be released soon.