ElementsProject/lightning

lightning_hsmd at 100% CPU on cln 24.08 mainnet

daywalker90 opened this issue · 9 comments

I've seen this multiple times now. lightning_hsmd sits at 100% CPU and i have to restart cln to get it back to normal. Strangely this is not happening with my testnet nodes running on the same machine and the same binaries.

Concur. But I also see both lightning_hsmd and lightning_gossi keep running at 100% after I shut down lightingd:

$ sudo systemctl stop lightningd
$ sudo systemctl status lightningd
○ lightningd.service - Core Lightning daemon
     Loaded: loaded (/etc/systemd/system/lightningd.service; disabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2024-09-10 15:14:36 CEST; 5min ago
    Process: 17183 ExecStart=/usr/bin/lightningd (code=killed, signal=TERM)
   Main PID: 17183 (code=killed, signal=TERM)
      Tasks: 2 (limit: 9438)
     Memory: 101.4M
        CPU: 19min 34.980s
     CGroup: /system.slice/lightningd.service
             ├─17252 /usr/bin/../libexec/c-lightning/lightning_hsmd
             └─17278 /usr/bin/../libexec/c-lightning/lightning_gossipd

sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17224 (sql) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17226 (bookkeeper) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17230 (peerswap) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17252 (lightning_hsmd) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17278 (lightning_gossi) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 73560 (circular) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: Stopped Core Lightning daemon.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Consumed 8min 33.185s CPU time.
sep 10 15:14:36 vlad-VirtualBox lightningd[17215]: Lost connection to the RPC socket.
sep 10 15:14:36 vlad-VirtualBox lightningd[17220]: Lost connection to the RPC socket.

$ top
top - 15:18:08 up  3:06,  2 users,  load average: 3,62, 2,47, 1,64
Tasks: 295 total,   5 running, 288 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0,4 us,  0,5 sy,  0,0 ni, 99,1 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :   7941,8 total,   3416,4 free,    827,8 used,   3697,6 buff/cache
MiB Swap:   2048,0 total,   1215,6 free,    832,4 used.   6816,5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  17252 vlad      20   0    5620   3328   3328 R 100,0   0,0   3:41.55 lightning_hsmd
  17278 vlad      20   0   20148   8072   3592 R 100,0   0,1   3:33.24 lightning_gossi
      1 root      20   0  167692  12244   8148 S   0,0   0,2   0:13.00 systemd
      2 root      20   0       0      0      0 S   0,0   0,0   0:00.05 kthreadd
      3 root      20   0       0      0      0 S   0,0   0,0   0:00.00 pool_workqueue_release
      4 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-rcu_g
      5 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-rcu_p
      6 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-slub_
      7 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-netns

They prevent my Ubuntu 22 from shutting down!

Hmmmm, the best way to debug this is:

sudo perf record --call-graph=dwarf -p `pidof lightning_hsmd`

Then sudo perf report will let you explore what is taking all the time.

Obviously this only works when hsmd is consuming lots of CPU.

Without knowing what it's doing, this is hard to debug! Presumably something is hammering it with requests: if we can figure out what the request is, that might give us a clue!

Same problem here, one cpu 100% load by lightning_hsmd and sometimes gossip too. I have to manually kill the stuck processes, it won't stop when I simply restart clightning process with systemctl.
I've run the perf commands, here is the result, I hope it helps:
Screenshot from 2024-09-11 08-23-56

Thanks! Ok, one more request, what does sudo strace -p $(pidof lightning_hsmd) -o /tmp/trace give? It will run forever, so just let it run for a few seconds then Control-C.

From this report, and another one, it's slamming the read/poll loop. I might have to add some debug for this. It doesn't happen for me :(

My instance is running on a rpi4, the 100% cpu bug happens after approximately half a day of ok running, every day it seems.
I run the strace, results are attached.
trace.txt

Another one with both commands:

image

trace.zip

Noticed my lightning_hsmd was running at 100% today.

Samples: 204K of event 'cycles:P', Event count (approx.): 192624081792                                 
  Children      Self  Command         Shared Object      Symbol                                        
+   99.80%     0.00%  lightning_hsmd  libc.so.6          [.] __libc_start_main@@GLIBC_2.34             
+   99.80%     0.00%  lightning_hsmd  libc.so.6          [.] __libc_start_call_main                    
+   99.80%     0.00%  lightning_hsmd  lightning_hsmd     [.] main                                      
+   99.80%     1.03%  lightning_hsmd  lightning_hsmd     [.] io_loop                                   
+   95.34%     1.38%  lightning_hsmd  [kernel.kallsyms]  [k] system_call_vectored_common               
+   93.92%     1.60%  lightning_hsmd  [kernel.kallsyms]  [k] system_call_exception                     
+   89.70%     0.29%  lightning_hsmd  lightning_hsmd     [.] daemon_poll                               
+   88.26%     0.50%  lightning_hsmd  libc.so.6          [.] __poll                                    
+   85.50%     0.12%  lightning_hsmd  [kernel.kallsyms]  [k] sys_poll                                  
+   85.36%     6.22%  lightning_hsmd  [kernel.kallsyms]  [k] do_sys_poll                               
+   42.76%    13.25%  lightning_hsmd  [kernel.kallsyms]  [k] sock_poll                                 
+   32.89%     0.00%  lightning_hsmd  [unknown]          [k] 0000000000000000                          
+   31.10%    14.36%  lightning_hsmd  [kernel.kallsyms]  [k] unix_poll                                 
+   24.14%     0.36%  lightning_hsmd  [kernel.kallsyms]  [k] poll_freewait                             
+   16.69%     6.36%  lightning_hsmd  [kernel.kallsyms]  [k] __pollwait                                
+   15.77%    15.76%  lightning_hsmd  [kernel.kallsyms]  [k] fput                                      
+   10.78%    10.78%  lightning_hsmd  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                    
+    8.86%     0.12%  lightning_hsmd  lightning_hsmd     [.] io_ready                                  
+    8.86%     0.18%  lightning_hsmd  lightning_hsmd     [.] do_plan                                   
+    8.33%     0.25%  lightning_hsmd  lightning_hsmd     [.] do_read_wire                              
+    8.13%     0.21%  lightning_hsmd  lightning_hsmd     [.] do_read_wire_header                       
+    7.87%     0.46%  lightning_hsmd  libc.so.6          [.] read                                      
+    7.33%     0.39%  lightning_hsmd  [kernel.kallsyms]  [k] remove_wait_queue                         
+    6.51%     0.63%  lightning_hsmd  [kernel.kallsyms]  [k] add_wait_queue                            
+    5.33%     5.33%  lightning_hsmd  [kernel.kallsyms]  [k] __fget_light                              
+    5.20%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] ksys_read                                 
+    5.16%     0.07%  lightning_hsmd  [kernel.kallsyms]  [k] get_free_pages_noprof                     
+    4.92%     0.05%  lightning_hsmd  [kernel.kallsyms]  [k] alloc_pages_mpol_noprof                   
+    4.73%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] vfs_read                                  
+    4.39%     0.15%  lightning_hsmd  [kernel.kallsyms]  [k] __alloc_pages_noprof                      
+    4.13%     0.64%  lightning_hsmd  [kernel.kallsyms]  [k] get_page_from_freelist                    
+    3.15%     0.09%  lightning_hsmd  [kernel.kallsyms]  [k] post_alloc_hook                           
+    3.07%     3.06%  lightning_hsmd  [kernel.kallsyms]  [k] kernel_init_pages                         
+    2.83%     0.54%  lightning_hsmd  [kernel.kallsyms]  [k] arch_local_irq_restore                    
+    2.64%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] sock_read_iter                            

OK, this did it! I found the case where we get a 0 read and ignore EOF. Thankyou, everyone, for your patience: I've added this to the v24.08.1 milestone, which will obviously need to be released soon.