Azure/WALinuxAgent

[BUG] OOMKill every hour

Pluggi opened this issue · 2 comments

Pluggi commented

Describe the bug: A clear and concise description of what the bug is.

The WALinuxAgent process is running every hour and getting killed everytime. There is plenty of RAM available:

# date -Ins; free -mh
2023-04-17T09:43:26,011104086+00:00
              total        used        free      shared  buff/cache   available
Mem:           125G        5.3G         50G         18G         70G        101G
Swap:            0B          0B          0B

Here are some logs:

# dmesg -T | grep 'Killed process'
[Mon Apr 17 04:16:45 2023] Memory cgroup out of memory: Killed process 179638 (python3) total-vm:81128kB, anon-rss:13824kB, file-rss:9284kB, shmem-rss:0kB, UID:0 pgtables:196kB oom_score_adj:0
[Mon Apr 17 05:16:50 2023] Memory cgroup out of memory: Killed process 305640 (python3) total-vm:80996kB, anon-rss:13576kB, file-rss:9340kB, shmem-rss:0kB, UID:0 pgtables:200kB oom_score_adj:0
[Mon Apr 17 06:16:55 2023] Memory cgroup out of memory: Killed process 432966 (python3) total-vm:80996kB, anon-rss:13584kB, file-rss:9132kB, shmem-rss:0kB, UID:0 pgtables:200kB oom_score_adj:0
[Mon Apr 17 07:13:56 2023] Memory cgroup out of memory: Killed process 926955 (postgres) total-vm:640668kB, anon-rss:2684kB, file-rss:6932kB, shmem-rss:475016kB, UID:26 pgtables:1200kB oom_score_adj:985
[Mon Apr 17 07:13:56 2023] Memory cgroup out of memory: Killed process 430471 (postgres) total-vm:690984kB, anon-rss:47020kB, file-rss:14620kB, shmem-rss:353184kB, UID:26 pgtables:1248kB oom_score_adj:985
[Mon Apr 17 07:17:00 2023] Memory cgroup out of memory: Killed process 560448 (python3) total-vm:80996kB, anon-rss:13588kB, file-rss:9392kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:0
[Mon Apr 17 07:40:15 2023] Memory cgroup out of memory: Killed process 553674 (postgres) total-vm:669692kB, anon-rss:30128kB, file-rss:14712kB, shmem-rss:285632kB, UID:26 pgtables:960kB oom_score_adj:985
[Mon Apr 17 08:17:05 2023] Memory cgroup out of memory: Killed process 687759 (python3) total-vm:80988kB, anon-rss:13608kB, file-rss:9216kB, shmem-rss:0kB, UID:0 pgtables:196kB oom_score_adj:0
[Mon Apr 17 09:17:10 2023] Memory cgroup out of memory: Killed process 807996 (python3) total-vm:80740kB, anon-rss:13612kB, file-rss:9180kB, shmem-rss:0kB, UID:0 pgtables:188kB oom_score_adj:0

# journalctl -e
Apr 17 09:17:04 aks-general1z2-58172555-vmss000006 kubelet[5790]: I0417 09:17:04.861151    5790 kubelet.go:2079] "SyncLoop UPDATE" source="api" pods=[namespace/pod-name]
Apr 17 09:17:06 aks-general1z2-58172555-vmss000006 python3[1626]: 2023-04-17T09:17:06.420655Z INFO CollectLogsHandler ExtHandler Starting log collection...
Apr 17 09:17:06 aks-general1z2-58172555-vmss000006 systemd[1]: Started /usr/bin/python3 -u bin/WALinuxAgent-2.9.0.4-py2.7.egg -collect-logs.
Apr 17 09:17:08 aks-general1z2-58172555-vmss000006 kubelet[5790]: I0417 09:17:08.218778    5790 kubelet.go:2079] "SyncLoop UPDATE" source="api" pods=[namespace/pod-name]
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: python3 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: CPU: 3 PID: 807996 Comm: python3 Not tainted 5.4.0-1091-azure #96~18.04.1-Ubuntu
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: Call Trace:
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  dump_stack+0x57/0x6d
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  dump_header+0x4f/0x200
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  oom_kill_process+0xe6/0x120
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  out_of_memory+0x117/0x540
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  mem_cgroup_out_of_memory+0xbb/0xd0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  try_charge+0x762/0x7c0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  ? __alloc_pages_nodemask+0x153/0x320
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  mem_cgroup_try_charge+0x75/0x190
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  mem_cgroup_try_charge_delay+0x22/0x50
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  __handle_mm_fault+0x943/0x1330
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  handle_mm_fault+0xb7/0x200
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  __do_page_fault+0x29c/0x4c0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  do_page_fault+0x35/0x110
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel:  page_fault+0x39/0x40
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: RIP: 0033:0x4b9997
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: Code: 20 89 72 14 83 ee 01 49 bd ab aa aa aa aa aa aa aa 48 8b 7c 24 30 48 89 f8 48 2b 44 24 20 48 8b 57 08 48 c1 f8 04 49 0f af c5 <c7> 42 24 ff ff 00 00 89 42 20 48 8d 82 00 10 00 00 48 89 47 08 89
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: RSP: 002b:00007fff07ece030 EFLAGS: 00010a07
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: RAX: 0000000000000022 RBX: 0000000000000001 RCX: 0000000000000027
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: RDX: 00007faac2b72000 RSI: 0000000000000015 RDI: 00000000019652d0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: RBP: 000000000000014f R08: 00007faac2b71000 R09: 00000000000000a0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: R10: 0000000000000001 R11: 0000000000000017 R12: 00007faac2b71378
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: R13: aaaaaaaaaaaaaaab R14: 00007faac2b71530 R15: 00000000019af970
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: memory: usage 30720kB, limit 30720kB, failcnt 223429
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: kmem: usage 17092kB, limit 9007199254740988kB, failcnt 0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: Memory cgroup stats for /azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice:
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: anon 14409728
                                                           file 135168
                                                           kernel_stack 110592
                                                           slab 16142336
                                                           sock 0
                                                           shmem 0
                                                           file_mapped 0
                                                           file_dirty 0
                                                           file_writeback 0
                                                           anon_thp 0
                                                           inactive_anon 0
                                                           active_anon 14868480
                                                           inactive_file 442368
                                                           active_file 110592
                                                           unevictable 0
                                                           slab_reclaimable 6770688
                                                           slab_unreclaimable 9371648
                                                           pgfault 31423161
                                                           pgmajfault 0
                                                           workingset_refault 125004
                                                           workingset_activate 6666
                                                           workingset_nodereclaim 0
                                                           pgrefill 221427
                                                           pgscan 580758
                                                           pgsteal 251185
                                                           pgactivate 236709
                                                           pgdeactivate 221426
                                                           pglazyfree 0
                                                           pglazyfreed 0
                                                           thp_fault_alloc 0
                                                           thp_collapse_alloc 0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: Tasks state (memory values in pages):
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: [ 807996]     0 807996    20185     5698   192512        0             0 python3
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task=python3,pid=807996,uid=0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kernel: Memory cgroup out of memory: Killed process 807996 (python3) total-vm:80740kB, anon-rss:13612kB, file-rss:9180kB, shmem-rss:0kB, UID:0 pgtables:188kB oom_score_adj:0
Apr 17 09:17:11 aks-general1z2-58172555-vmss000006 kubelet[5790]: I0417 09:17:11.034014    5790 manager.go:1248] failed getting container info for "/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice": unknown container "/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice"

Distro and WALinuxAgent details (please complete the following information):

waagent --version
WALinuxAgent-2.2.45 running on ubuntu 18.04
Python: 3.6.9
Goal state agent: 2.9.0.4

Additional context

Thanks for reporting, we already fix the issue #2757. It will roll out soon.

Closing - fix is being rolled out as part of release 2.9.1