Force Terminating Containers Can Leave Rouge Processes

Question

Force Terminating Containers Can Leave Rouge Processes

deedubb opened this issue 5 months ago · 10 comments

Describe the bug

I am running 1.26.6 clusters/nodes in Azure Kubernetes Service. I run large asp.net applications which are being migrates to Linux. In the meantime we are running .net4.8 based webservices on Windows 2022 servers. We are running nodepools of 17-20 servers, with 30 pods on each server -- that is just to say we have a large sample base.

We have found a problem where configuration of IIS sites can hang on nodes; and sometimes when IIS is being asked to stop in a container it can cause the process to hang indefinitely. That in itself might be an annoyance as liveliness checks would mark the pod unhealthy. However, on our case it's like it's a blocking I/O or kernel call and no command is powerful enough (from the pod, kubernetes kubectl, or even taskkill / taskmgr / wmi terminate process). We typically see everything but the powershell/w3wp/svchost, csrss, and the containerd process exit on its own during a force delete.

Our way of dealing with this is now not running force - so we have a reference to the terminating pod. When a pod has been terminating for more than a few minutes we will start change control procedures to evict the underlying node - evicting and deleting it while provisioning a new node. This is our only option to deal with these rogue processes

I do not understand, nor able to find documentation, on how containers and http.sys work on windows. When I ask from the node "netsh http" ok information I get nothing back.

To Reproduce
Steps to reproduce the behavior:
Run a lot of windows containers with IIS webapps with a large number of SSL bindings, multiple sites per windows container (not micro services)

Expected behavior
Containers wouldn't hang at shutdown. Csrss would be able to terminate children like an RDP session logging off.

Configuration:

Edition: Windows Server 2022
Base Image being used: Windows Server Core, asp.net 4.8 with WCF
Container engine: AKS 1.26.6, containerd
Container Engine version: 1.26.6

Answer 1 · 2024-03-15T05:15:31.000Z

I am seeing similar or probably related issues in AKS.

I can reproduce by issuing iisreset /stop after sigterm is sent to the container, the entry point hangs indefinitely.

Also, I am seeing too often pods stuck in destroying status requiring decommissioning and spinning a new node replacement. Some of these are IIS other are MSQL.

I also noticed that overloaded nodes have terrible issues terminating pods, it's easy to reproduce by using very small nodes (i.e. the cheapest B series VM) and putting some decent load into the the node becomes unresponsive and hangs (no RDP to the node is possible).

I will be capturing kubelet logs when any of this happens again.

Answer 2 · 2024-03-15T05:26:10.000Z

We've stopped working on this issue because we are having a bigger problem with w3wp.exe having locked pages when the w3wp.exe terminates causing the node to blue screen

Hopefully you can have a solution for me after we get our blue screening problem fixes

Answer 3 · 2024-03-15T14:51:17.000Z

Have you escalated this to azure support? BSOD on an AKS node looks like something they should be looking into.

Answer 4 · 2024-03-16T03:05:01.000Z

Yeah, have you used azure support? Brutal. Asking me for all my yamls yada yada. I personally think it's because we "saved money" by using AMD

…

On Fri, Mar 15, 2024, 7:51 AM Davvid ***@***.***> wrote: Have you escalated this to azure support? BSOD on an AKS node looks like something they should be looking into. — Reply to this email directly, view it on GitHub <#469 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAN4UENW2KMA5HZ74AB4N3YYMDHXAVCNFSM6AAAAABEBW2OXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJZHAZTIMJVHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 5 · 2024-03-18T18:54:50.000Z

Hey @deedubb, can you provide some repro steps where you can trigger this issue?

Answer 6 · 2024-03-26T18:43:21.000Z

+1 to @fady-azmy-msft's comment. We can't take a look without explicit repro steps.

Answer 7 · 2024-03-27T01:37:11.000Z

We just finished the upgrade to 1.28 in our test environment -- give me a few days to see if I can find repo steps

…

On Tue, Mar 26, 2024, 11:43 AM Nicole Trappe ***@***.***> wrote: +1 to @fady-azmy-msft <https://github.com/fady-azmy-msft>'s comment. We can't take a look without explicit repro steps. — Reply to this email directly, view it on GitHub <#469 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAN4UDXRFJTOU4CDNFTUVLY2G6V7AVCNFSM6AAAAABEBW2OXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRRGIYTINBXGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 8 · 2024-03-27T17:26:32.000Z

Awesome, thanks @deedubb

Answer 9 · 2024-04-26T18:17:11.000Z

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Answer 10 · 2024-04-27T02:08:20.000Z

We upgraded to 1.28, we've been busy dealing with all the instability after the upgrade - hns unable to deallocate for containers, IP address allocator going offline and refusing connections, crashing nodes due to protected memory still being referenced... I'm sure I'm missing a few.

I have a laundry list of feedback for running windows nodes in AKS -- tldr: don't -- however I do not have repo steps