bentoml/Yatai

Yatai memory leak

a-pichard opened this issue · 8 comments

Hi, i noticed that my yatai pod (running inside yatai-system namspace) kept getting evicted due to memory pressure on the node, but i don't think that running it on a bigger node would solve the issue, it looks like a memory leak to me

Screenshot 2024-03-25 at 11 53 30

Running yatai 1.1.13

Is it happening to anyone else here ?

I've the same problem, even though I asked in the slack channel I couldn't get an answer.

I’m using image: quay.io/bentoml/yatai:1.1.13 but its keep getting oom killed in my cluster.

Screenshot 2024-03-29 at 18 40 44

Hi, can you provide the yatai version, also 1.1.13? @a-pichard

Yes i am running yatai 1.1.13

It's been 3 weeks, is there anything that you can give us to understand the reason and possibly how to fix ?

Sorry for the late reply. According to the release notes, I don't think it was introduced in 1.1.13, probably an older version, since the 1.1.13 version only includes a minor fix in helm chart.
If you can provide the version without the memory leak problem will be helpful to find the root cause.

I downgraded to 1.1.11 and memory leak still continues. I checked the processes running inside of the container and this was the only one "/app/api-server serve -c /conf/config.yaml" where it consistently reaches the 3Gi memory limit and gets OOM killed, without any significant increase in workload. The application configuration and Kubernetes setup are standard, with memory limits set as expected. Could you please help identify what might be causing this memory usage spike?

/app/api-server serve is actually the entrypoint of yatai backend.
cc @yetone

I have been seeing this too. It seems to have something to do with the version of yatai-deployment.

image

In the above graph, yatai-deployment was upgraded to 1.1.21 around 17:00, and downgraded back to 1.1.13 around 14:00. I have a 1GiB memory limit set. I will play with it some more and see if I can pin down the exact version that introduces the leak.

edit:
Yatai version is 1.1.13.

It looks like it is yatai-deployment 1.1.19 that introduces the leak.