kadena-io/chainweb-data

Increasing CPU and memory consumption over time

blazesal opened this issue · 4 comments

Concern

When running chainweb-data application over a long period of time CPU and memory consumption increase. I am worried that eventually, it will result in the degradation of the system's performance.
May I please get the explanation of the observed behavior?

chainweb-data_cpu_1d
chainweb-data_cpu-30d
chainweb-data_cpu

The process was left running for days and the CPU and memory consumption was increasing. From the consumption graphs and logs I can tell that the service got restarted - the process was killed because it ran out of memory.

Service logs (chainweb-data):

Jun 29 22:20:46 localhost systemd[1]: kadena-indexer.service: Main process exited, code=killed, status=9/KILL
Jun 29 22:20:46 localhost systemd[1]: kadena-indexer.service: Failed with result 'signal'.
Jun 29 22:20:46 localhost systemd[1]: kadena-indexer.service: Scheduled restart job, restart counter is at 1.

System logs:


Jun 29 22:17:06 localhost CRON[385345]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 29 22:20:46 localhost kernel: [1337443.329441] postgres invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Jun 29 22:20:46 localhost kernel: [1337443.329445] CPU: 0 PID: 385364 Comm: postgres Not tainted 5.4.0-117-generic #132-Ubuntu
Jun 29 22:20:46 localhost kernel: [1337443.329446] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Jun 29 22:20:46 localhost kernel: [1337443.329447] Call Trace:
Jun 29 22:20:46 localhost kernel: [1337443.329454]  dump_stack+0x6d/0x8b
Jun 29 22:20:46 localhost kernel: [1337443.329457]  dump_header+0x4f/0x1eb
Jun 29 22:20:46 localhost kernel: [1337443.329459]  oom_kill_process.cold+0xb/0x10
Jun 29 22:20:46 localhost kernel: [1337443.329462]  out_of_memory+0x1cf/0x4d0
Jun 29 22:20:46 localhost kernel: [1337443.329465]  __alloc_pages_slowpath+0xd5e/0xe50
Jun 29 22:20:46 localhost kernel: [1337443.329468]  __alloc_pages_nodemask+0x2d0/0x320
Jun 29 22:20:46 localhost kernel: [1337443.329471]  alloc_pages_current+0x87/0xe0
Jun 29 22:20:46 localhost kernel: [1337443.329473]  __page_cache_alloc+0x72/0x90
Jun 29 22:20:46 localhost kernel: [1337443.329475]  pagecache_get_page+0xbf/0x300
Jun 29 22:20:46 localhost kernel: [1337443.329477]  filemap_fault+0x6b2/0xa50
Jun 29 22:20:46 localhost kernel: [1337443.329479]  ? unlock_page_memcg+0x12/0x20
Jun 29 22:20:46 localhost kernel: [1337443.329481]  ? page_add_file_rmap+0xff/0x1a0
Jun 29 22:20:46 localhost kernel: [1337443.329483]  ? filemap_map_pages+0x24c/0x380
Jun 29 22:20:46 localhost kernel: [1337443.329485]  ext4_filemap_fault+0x32/0x50
Jun 29 22:20:46 localhost kernel: [1337443.329488]  __do_fault+0x3c/0x170
Jun 29 22:20:46 localhost kernel: [1337443.329489]  do_fault+0x24b/0x640
Jun 29 22:20:46 localhost kernel: [1337443.329491]  __handle_mm_fault+0x4c5/0x7a0
Jun 29 22:20:46 localhost kernel: [1337443.329493]  handle_mm_fault+0xca/0x200
Jun 29 22:20:46 localhost kernel: [1337443.329496]  do_user_addr_fault+0x1f9/0x450
Jun 29 22:20:46 localhost kernel: [1337443.329497]  __do_page_fault+0x58/0x90
Jun 29 22:20:46 localhost kernel: [1337443.329500]  ? exit_to_usermode_loop+0xea/0x160
Jun 29 22:20:46 localhost kernel: [1337443.329501]  do_page_fault+0x2c/0xe0
Jun 29 22:20:46 localhost kernel: [1337443.329504]  do_async_page_fault+0x39/0x70
Jun 29 22:20:46 localhost kernel: [1337443.329506]  async_page_fault+0x34/0x40
Jun 29 22:20:46 localhost kernel: [1337443.329508] RIP: 0033:0x55681dc4b700
Jun 29 22:20:46 localhost kernel: [1337443.329513] Code: Bad RIP value.
Jun 29 22:20:46 localhost kernel: [1337443.329514] RSP: 002b:00007fff8daba0a8 EFLAGS: 00010246
Jun 29 22:20:46 localhost kernel: [1337443.329515] RAX: 000055681ff04140 RBX: 0000000000000000 RCX: 0000000000000000
Jun 29 22:20:46 localhost kernel: [1337443.329517] RDX: 000055681ff05ab8 RSI: 00007f5d5ab7a778 RDI: 00000000000004e1
Jun 29 22:20:46 localhost kernel: [1337443.329517] RBP: 00007fff8daba0f0 R08: 000055681ff04140 R09: 0000000000000000
Jun 29 22:20:46 localhost kernel: [1337443.329518] R10: 00007f5d63771728 R11: 000055681fe866b0 R12: 000055681ff03850
Jun 29 22:20:46 localhost kernel: [1337443.329519] R13: 00007f5d5ab7a458 R14: 00007f5d5abb5aa8 R15: 0000000000000002
Jun 29 22:20:46 localhost kernel: [1337443.329520] Mem-Info:
Jun 29 22:20:46 localhost kernel: [1337443.329524] active_anon:1718800 inactive_anon:253964 isolated_anon:0
Jun 29 22:20:46 localhost kernel: [1337443.329524]  active_file:115 inactive_file:124 isolated_file:0
Jun 29 22:20:46 localhost kernel: [1337443.329524]  unevictable:0 dirty:0 writeback:0 unstable:0
Jun 29 22:20:46 localhost kernel: [1337443.329524]  slab_reclaimable:9000 slab_unreclaimable:13126
Jun 29 22:20:46 localhost kernel: [1337443.329524]  mapped:35890 shmem:36019 pagetables:4851 bounce:0
Jun 29 22:20:46 localhost kernel: [1337443.329524]  free:26686 free_pcp:1301 free_cma:0
Jun 29 22:20:46 localhost kernel: [1337443.329527] Node 0 active_anon:6875200kB inactive_anon:1015856kB active_file:460kB inactive_file:496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:143560kB dirty:0kB writeback:0kB shmem:144076kB shmem_thp: 0kB shme
m_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jun 29 22:20:46 localhost kernel: [1337443.329528] Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pa
getables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jun 29 22:20:46 localhost kernel: [1337443.329531] lowmem_reserve[]: 0 1887 7853 7853 7853
Jun 29 22:20:46 localhost kernel: [1337443.329533] Node 0 DMA32 free:40064kB min:16208kB low:20260kB high:24312kB active_anon:1669504kB inactive_anon:283864kB active_file:248kB inactive_file:120kB unevictable:0kB writepending:0kB present:2080616kB managed:2015080kB mloc
ked:0kB kernel_stack:64kB pagetables:4244kB bounce:0kB free_pcp:1824kB local_pcp:504kB free_cma:0kB
Jun 29 22:20:46 localhost kernel: [1337443.329537] lowmem_reserve[]: 0 0 5966 5966 5966
Jun 29 22:20:46 localhost kernel: [1337443.329539] Node 0 Normal free:51276kB min:51240kB low:64048kB high:76856kB active_anon:5205696kB inactive_anon:731992kB active_file:1016kB inactive_file:1284kB unevictable:0kB writepending:0kB present:6291456kB managed:6117180kB m
locked:0kB kernel_stack:2560kB pagetables:15160kB bounce:0kB free_pcp:3168kB local_pcp:988kB free_cma:0kB
Jun 29 22:20:46 localhost kernel: [1337443.329542] lowmem_reserve[]: 0 0 0 0 0
Jun 29 22:20:46 localhost kernel: [1337443.329544] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Jun 29 22:20:46 localhost kernel: [1337443.329551] Node 0 DMA32: 377*4kB (UME) 319*8kB (UE) 382*16kB (UME) 263*32kB (UME) 145*64kB (UME) 52*128kB (UME) 19*256kB (UME) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 39900kB
Jun 29 22:20:46 localhost kernel: [1337443.329557] Node 0 Normal: 156*4kB (UMH) 634*8kB (UMEH) 1016*16kB (UMEH) 400*32kB (UMEH) 135*64kB (UME) 57*128kB (UMEH) 4*256kB (MH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 51712kB
Jun 29 22:20:46 localhost kernel: [1337443.329565] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 29 22:20:46 localhost kernel: [1337443.329566] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 29 22:20:46 localhost kernel: [1337443.329567] 36624 total pagecache pages
Jun 29 22:20:46 localhost kernel: [1337443.329568] 358 pages in swap cache
Jun 29 22:20:46 localhost kernel: [1337443.329569] Swap cache stats: add 215039, delete 214681, find 19419431/19422500
Jun 29 22:20:46 localhost kernel: [1337443.329570] Free swap  = 0kB
Jun 29 22:20:46 localhost kernel: [1337443.329570] Total swap = 524284kB
Jun 29 22:20:46 localhost kernel: [1337443.329571] 2097016 pages RAM
Jun 29 22:20:46 localhost kernel: [1337443.329572] 0 pages HighMem/MovableOnly
Jun 29 22:20:46 localhost kernel: [1337443.329572] 59974 pages reserved
Jun 29 22:20:46 localhost kernel: [1337443.329572] 0 pages cma reserved
Jun 29 22:20:46 localhost kernel: [1337443.329573] 0 pages hwpoisoned
Jun 29 22:20:46 localhost kernel: [1337443.329573] Tasks state (memory values in pages):
Jun 29 22:20:46 localhost kernel: [1337443.329574] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 29 22:20:46 localhost kernel: [1337443.329578] [    401]     0   401    30364      451   245760        0          -250 systemd-journal
Jun 29 22:20:46 localhost kernel: [1337443.329581] [    433]     0   433     5426      437    65536        0         -1000 systemd-udevd
Jun 29 22:20:46 localhost kernel: [1337443.329583] [    481]   100   481     4594      228    77824        0             0 systemd-network
Jun 29 22:20:46 localhost kernel: [1337443.329584] [    484]     0   484     2111      798    53248        0             0 haveged
Jun 29 22:20:46 localhost kernel: [1337443.329586] [    485]   101   485     5964     1005    90112        0             0 systemd-resolve
Jun 29 22:20:46 localhost kernel: [1337443.329588] [    486]   102   486    22547      211    77824        0             0 systemd-timesyn
Jun 29 22:20:46 localhost kernel: [1337443.329590] [    627]     0   627     1703       65    49152        0             0 cron
Jun 29 22:20:46 localhost kernel: [1337443.329591] [    628]   103   628     1922      225    53248        0          -900 dbus-daemon
Jun 29 22:20:46 localhost kernel: [1337443.329593] [    636]     0   636     7414     2008    98304       42             0 networkd-dispat
Jun 29 22:20:46 localhost kernel: [1337443.329595] [    637]     0   637    58178      212    90112        0             0 polkitd
Jun 29 22:20:46 localhost kernel: [1337443.329596] [    639]   104   639    56087      456    86016        0             0 rsyslogd
Jun 29 22:20:46 localhost kernel: [1337443.329598] [    647]     0   647     4198      236    69632        0             0 systemd-logind
Jun 29 22:20:46 localhost kernel: [1337443.329600] [    654]     0   654      948       43    49152        0             0 atd
Jun 29 22:20:46 localhost kernel: [1337443.329601] [    658]     0   658     3043      231    65536        0         -1000 sshd
Jun 29 22:20:46 localhost kernel: [1337443.329603] [    683]     0   683    78614      433   114688        0             0 ModemManager
Jun 29 22:20:46 localhost kernel: [1337443.329605] [    703]     0   703    26978     1886   110592       29             0 unattended-upgr
Jun 29 22:20:46 localhost kernel: [1337443.329607] [    719]     0   719     1400       30    45056        0             0 agetty
Jun 29 22:20:46 localhost kernel: [1337443.329609] [    723]     0   723     1457       28    45056        0             0 agetty
Jun 29 22:20:46 localhost kernel: [1337443.329610] [  15192]   113 15192    54401     2959   163840      310          -900 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329612] [  15194]   113 15194    54532    35150   417792      320             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329614] [  15195]   113 15195    54434    34835   417792      316             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329616] [  15196]   113 15196    54401     1293   143360      326             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329617] [  15197]   113 15197    54570     1062   217088      362             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329619] [  15198]   113 15198    17582      245   118784      320             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329621] [  15199]   113 15199    54509      385   147456      368             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329623] [  34058]  1000 34058 268822249  1925069 16777216   129916             0 chainweb-data
Jun 29 22:20:46 localhost kernel: [1337443.329625] [ 385311]   113 385311    54844     7442   385024      272             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329627] [ 385363]     0 385363     3043      209    65536        0             0 sshd
Jun 29 22:20:46 localhost kernel: [1337443.329629] [ 385364]   113 385364    54633     1020   163840      287             0 postgres
Jun 29 22:20:46 localhost kernel: [1337443.329630] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kadena-indexer.service,task=chainweb-data,pid=34058,uid=1000
Jun 29 22:20:46 localhost kernel: [1337443.329671] Out of memory: Killed process 34058 (chainweb-data) total-vm:1075288996kB, anon-rss:7700276kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:16384kB oom_score_adj:0
Jun 29 22:20:46 localhost kernel: [1337443.742795] oom_reaper: reaped process 34058 (chainweb-data), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Zrzut ekranu 2022-06-30 o 16 58 16

Zrzut ekranu 2022-06-30 o 16 57 55


@blazesal thank you for opening this ticket. I want to let you know that we'll turn our focus to chainweb-data's performance right after releasing the new features that are about to be merged.

@blazesal Thanks again for opening this ticket. I believe this performance issue should be fixed in the recent 2.2.0 release. I would appreciate if you could upgrade to 2.2.0 and check if it does indeed resolve it for you. Otherwise, I'll close this ticket in a week.

Closing this ticket since I believe it already got fixed by #148 as part of the 2.2.0 release