Azure/WALinuxAgent

[Questions] Questions related with OOM issues

kmin1223 opened this issue · 1 comments

  1. symptom: walinuxagent is being killed by OOM killer every hours.

VM id : /subscriptions/15c23017-dd30-4b1d-9eb8-23240a12ffaa/resourceGroups/RG-krz-p-mgt/providers/Microsoft.Compute/virtualMachines/krz-p-mgt-zab02

Offer | UbuntuServer
SKU | 18.04-LTS
Exact Version | 18.04.201912180
VM Size | Standard_D16s_v3

  1. syslog

May 17 07:01:19 krz-p-mgt-zab02 python3[20236]: 2023-05-17T07:01:19.018010Z INFO CollectLogsHandler ExtHandler Starting log collection...

May 17 07:01:19 krz-p-mgt-zab02 systemd[1]: Started /usr/bin/python3 -u /usr/sbin/waagent -collect-logs.

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902598] python3 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902602] CPU: 0 PID: 7211 Comm: python3 Tainted: G        W         5.4.0-1063-azure #66~18.04.1-Ubuntu

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902603] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902603] Call Trace:

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902611]  dump_stack+0x57/0x6d

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902616]  dump_header+0x4f/0x200

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902618]  oom_kill_process+0xe6/0x120

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902620]  out_of_memory+0x117/0x540

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902623]  mem_cgroup_out_of_memory+0xbb/0xd0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902625]  try_charge+0x762/0x7c0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902628]  ? __alloc_pages_nodemask+0x153/0x320

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902631]  mem_cgroup_try_charge+0x75/0x190

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902632]  mem_cgroup_try_charge_delay+0x22/0x50

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902636]  __handle_mm_fault+0x943/0x1330

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902638]  handle_mm_fault+0xb7/0x200

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902641]  __do_page_fault+0x29c/0x4c0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902642]  do_page_fault+0x35/0x110

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902646]  page_fault+0x39/0x40

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902648] RIP: 0033:0x7f8d077a226d

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902650] Code: 31 f6 48 29 d8 48 8d 3c 19 49 39 d6 40 0f 95 c6 48 83 cb 01 48 83 c8 01 48 c1 e6 02 48 89 da 49 89 7e 60 48 09 f2 48 89 51 08 <48> 89 47 08 e9 8d fe ff ff 48 8b 4a 28 eb 04 48 8b 49 28 48 8b 51

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902651] RSP: 002b:00007ffde9e9c6e0 EFLAGS: 00010206

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902652] RAX: 0000000000013471 RBX: 0000000000000c11 RCX: 0000000002133f80

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902653] RDX: 0000000000000c11 RSI: 0000000000000000 RDI: 0000000002134b90

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902654] RBP: ffffffffffffffb0 R08: 0000000000000077 R09: 0000000000000000

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902654] R10: 0000000001e91010 R11: 0000000000000000 R12: 00000000000000bf

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902655] R13: 00007f8d07af8ca0 R14: 00007f8d07af8c40 R15: 0000000000000000

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902656] memory: usage 30720kB, limit 30720kB, failcnt 1861487

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902657] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902658] kmem: usage 17896kB, limit 9007199254740988kB, failcnt 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902658] Memory cgroup stats for /azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice:

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] anon 13312000

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] file 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] kernel_stack 73728

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] slab 18198528

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] sock 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] shmem 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] file_mapped 675840

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] file_dirty 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] file_writeback 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] anon_thp 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] inactive_anon 6156288

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] active_anon 6705152

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] inactive_file 98304

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] active_file 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] unevictable 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] slab_reclaimable 8925184

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] slab_unreclaimable 9273344

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] pgfault 22501017

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] pgmajfault 15246

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] workingset_refault 3152358

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] workingset_activate 74481

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] workingset_nodereclaim 0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] pgrefill 828604

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] pgscan 5576674

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] pgsteal 4487420

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902667] pgactivate 887304

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902668] Tasks state (memory values in pages):

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902669] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902671] [   7211]     0  7211    19957     5497   192512        0             0 python3

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902672] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task=python3,pid=7211,uid=0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.902679] Memory cgroup out of memory: Killed process 7211 (python3) total-vm:79828kB, anon-rss:12744kB, file-rss:9244kB, shmem-rss:0kB, UID:0 pgtables:188kB oom_score_adj:0

May 17 07:01:22 krz-p-mgt-zab02 kernel: [18823778.929672] oom_reaper: reaped process 7211 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

May 17 07:01:23 krz-p-mgt-zab02 python3[20236]: 2023-05-17T07:01:23.345968Z INFO CollectLogsHandler ExtHandler Log Collector exited with code -9

  1. from waagent log, version is 2.9.0.4 

2023-05-17T10:26:01.329859Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 49;HeartbeatId: F7F025A7-6EDF-4139-8C52-CF5C15B5BCA8;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]

from the waagent log, it seems Log Collector is exited with code -9

2023-05-17T09:01:27.607628Z INFO CollectLogsHandler ExtHandler Starting log collection...
2023-05-17T09:01:32.046126Z INFO CollectLogsHandler ExtHandler Log Collector exited with code -9
2023-05-17T09:25:58.823618Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 47;HeartbeatId: F7F025A7-6EDF-4139-8C52-CF5C15B5BCA8;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]

Question

  1. it seems it's related with below bug. is it correct?
    #2805

  2. it looks it's pre-release stage. May I know when this new version is going to release to KoreaCentral?
    https://github.com/Azure/WALinuxAgent/releases

  3. while it's in the pre-release, are there any workarounds?

  4. what is clean version from this bug? If possible, I would suggest to lower waagent to the good version.

@kmin1223

1- Yes
2- It should reach the entire fleet within the next 3 - 4 weeks
3- You could disable the process that is being terminating setting Logs.Collect to n in /etc/waagent.conf
4- Release 2.9.1 fixes this issue