microsoft/SDN

hnsautomitigator2019 Memory Usage

chandlerkent opened this issue · 5 comments

We have been running the hnsautomitigator2019 daemon set in this repository (https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/hnsautomitigator/hnsautomitigator2019.yaml) and we are seeing significant memory usage by this pod in what looks like a clear memory leak. Here are some graphs of the average memory usage across 4 different AKS clusters:

image
image
image
image

We see individual pods using upwards of 20 GB of memory:

image

We think this memory usage is leading to performance issues of our workloads running on these same nodes and we have seen networking outages on other pods on these nodes correlate with crashes of the hnsautomitigator2019 pods on those nodes.

I am opening the issue here for public discussion, but we've also opened a case with our Azure Support, which also has attached a memory dump of one of the pods:

#2407310040012800

We attempted to remove hnsautomitigator2019 from our cluster, but saw a sharp increase in scheduling issues so we had to add it back to our clusters.

Hi @princepereira, we have added Remove-Job -Job $job -Force to the Powershell (after line 106) and has dramatically reduced our memory usage:

image

I would love to get this into the "official" version in this repository as well.

Additionally, I was wondering why the break on line 113?

This means the pods runs to completion and then needs to restart. Is there something about this particular code path where we want the pod to restart?

Remove-Job

Thanks @chandlerkent for taking a look at this. Remove-Job -Job $job -Force might make sense. I will check with the team and provide you with official version

Regarding the break at 113:
Else block is meant to catch a deadlock for one of the hnsid. if a deadlock is detected, then we can go ahead with automitigate function and break out of the hnsids loop, so the break command.

@princepereira thank you for the response.

In our experience, the break on line 113 is breaking out of the while loop, which then causes the pod to restart. I do not think this was the intended usage of the break, however it has been a nice side effect because it is easy for us to know when the HNS service was restarted based on the pod's restart time. It was just odd to us that the pod would run to completion in this case, but not in other cases where action is taken.

@chandlerkent ,
This script is not an official deliverable from HNS team, so we won't be making any prs on the same. This script is provided to mitigate the issue, the recommendation is to upgrade to 2022.
We won't be having 2019 clusters to test the changes.
The second recommendation on break is not right. The break at line 113 will only break out from the inner for loop.

@princepereira thanks for the feedback. I will close this issue.