microsoft/Windows-Containers

Force Terminating Containers Can Leave Rouge Processes

deedubb opened this issue · 10 comments

Describe the bug

I am running 1.26.6 clusters/nodes in Azure Kubernetes Service. I run large asp.net applications which are being migrates to Linux. In the meantime we are running .net4.8 based webservices on Windows 2022 servers. We are running nodepools of 17-20 servers, with 30 pods on each server -- that is just to say we have a large sample base.

We have found a problem where configuration of IIS sites can hang on nodes; and sometimes when IIS is being asked to stop in a container it can cause the process to hang indefinitely. That in itself might be an annoyance as liveliness checks would mark the pod unhealthy. However, on our case it's like it's a blocking I/O or kernel call and no command is powerful enough (from the pod, kubernetes kubectl, or even taskkill / taskmgr / wmi terminate process). We typically see everything but the powershell/w3wp/svchost, csrss, and the containerd process exit on its own during a force delete.

Our way of dealing with this is now not running force - so we have a reference to the terminating pod. When a pod has been terminating for more than a few minutes we will start change control procedures to evict the underlying node - evicting and deleting it while provisioning a new node. This is our only option to deal with these rogue processes

I do not understand, nor able to find documentation, on how containers and http.sys work on windows. When I ask from the node "netsh http" ok information I get nothing back.

To Reproduce
Steps to reproduce the behavior:
Run a lot of windows containers with IIS webapps with a large number of SSL bindings, multiple sites per windows container (not micro services)

Expected behavior
Containers wouldn't hang at shutdown. Csrss would be able to terminate children like an RDP session logging off.

Configuration:

  • Edition: Windows Server 2022
  • Base Image being used: Windows Server Core, asp.net 4.8 with WCF
  • Container engine: AKS 1.26.6, containerd
  • Container Engine version: 1.26.6

I am seeing similar or probably related issues in AKS.

I can reproduce by issuing iisreset /stop after sigterm is sent to the container, the entry point hangs indefinitely.

Also, I am seeing too often pods stuck in destroying status requiring decommissioning and spinning a new node replacement. Some of these are IIS other are MSQL.

I also noticed that overloaded nodes have terrible issues terminating pods, it's easy to reproduce by using very small nodes (i.e. the cheapest B series VM) and putting some decent load into the the node becomes unresponsive and hangs (no RDP to the node is possible).

I will be capturing kubelet logs when any of this happens again.

We've stopped working on this issue because we are having a bigger problem with w3wp.exe having locked pages when the w3wp.exe terminates causing the node to blue screen

Hopefully you can have a solution for me after we get our blue screening problem fixes

Have you escalated this to azure support? BSOD on an AKS node looks like something they should be looking into.

Hey @deedubb, can you provide some repro steps where you can trigger this issue?

+1 to @fady-azmy-msft's comment. We can't take a look without explicit repro steps.

Awesome, thanks @deedubb

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

We upgraded to 1.28, we've been busy dealing with all the instability after the upgrade - hns unable to deallocate for containers, IP address allocator going offline and refusing connections, crashing nodes due to protected memory still being referenced... I'm sure I'm missing a few.

I have a laundry list of feedback for running windows nodes in AKS -- tldr: don't -- however I do not have repo steps