Slow memory leak causing cosign subprocess termination
sbeck14 opened this issue · 12 comments
Describe the bug
We have connaisseur 2.8.0
deployed in Kubernetes using a Helm install, and after running normally for a while (between a few days to a week) we start seeing Connaisseur reject images and producing these errors in the logs:
INFO: COSIGN output of trust root 'default' for image'[redacted]': RETURNCODE: -9; STDOUT: ; STDERR:
INFO: {'message': 'Unexpected Cosign exception for image "[redacted]": .', 'context': {'trust_data_type': 'dev.cosignproject.cosign/signature', 'stderr': '', 'image': '[redacted]', 'trust_root': 'default', 'detection_mode': False}}
Upon inspection of the memory usage of the Connaisseur pods, it appears that the memory usage of the Connaisseur process slowly rises over time, indicating a memory leak.
Result of ps
command on a pod immediately after launching
PID USER VSZ RSS COMMAND COMMAND
1 10001 60m 45m python python -m connaisseur
18 10001 1684 1072 sh /bin/sh
24 10001 1608 984 ps ps -o pid,user,vsz,rss,comm,args
Result of ps
command on a pod running for ~7 days
PID USER VSZ RSS COMMAND COMMAND
1 10001 317m 302m python python -m connaisseur
4827 10001 1692 1060 sh /bin/sh
4835 10001 1616 984 ps ps -o pid,user,vsz,rss,comm,args
Once the memory usage approaches the memory limit on the pod, our theory is that Kubernetes begins killing the cosign sub-process immediately after it spawns because it causes the pod to exceed the memory limit. Afterwards, the main Connaisseur process reports the error and the memory usage continues growing until Kubernetes finally kills the main process, restarting the pod.
Expected behavior
Connaisseur process memory does not grow and stays about the same over time
Optional: To reproduce
I haven't been able to reliably reproduce this without just waiting for the memory to grow over time. I tried restarting the deployment and hitting the API with an AdmissionReview
3000x in a row, but the memory usage only increased by about 10 mb. I also tried this with different types of AdmissionReview
(static verifier, cosign, and automatic child approval) with similar results.
Optional: Versions (please complete the following information as relevant):
- Kubernetes Cluster: Kubernetes v1.21.9
- Container registry: Gitlab Enterprise
- Connaisseur: v2.8.0
Here are our policies & validators from our helm values.yaml
:
validators:
- name: default
type: static
approve: true
- name: gitlab
type: cosign
auth:
secret_name: gitlab
trust_roots:
- name: default
key: awskms:///[redacted]
policy:
- pattern: "*:*"
- pattern: "[redacted domain]/*:*"
validator: gitlab
with:
trust_root: "*"
threshold: 1
I'll try to create a setup to stress test Connaisseur and see, whether I can replicate the problem. Other than that, with Connaisseur 3.0 we switched up the parallelization from concurrent.futures
back to asyncio
, maybe that did the trick 🤷
I'll see what I can do.
This is great news! I’ll test out version 3.0 in our environment and report back. Thank you!
@phbelitz Issue does not appear to have been resolved with version 3.0, we are still experiencing the memory leak after the update
Thought as much ... I'll try to look into it, but can't make promises. On the plus side: we are thinking about a rewrite in golang, where things hopefully will be a lot more efficient. This isn't set in stone yet, as a rewrite is quite costly, but I will keep you posted on any news :)
I'm experiencing the same memory leak issue in a production environment. I'm currently using Connaisseur version 3.0.0. This memory leak is a critical concern as it can lead to serious malfunctions and impact the stability of our production environment. I strongly urge prioritizing the treatment of this issue. please!
Note that Increasing the interval of readiness and liveness probes has resulted in a slower memory leak. As these endpoints have no specific functionality (simply returning a 200 status), I suspect it could be an issue related to Flask or maybe even to Cheroot that wraps it.
@jacobkoren1 would you be so kind and open a PR with you fixes?
@jacobkoren1 would you be so kind and open a PR with you fixes?
Sure
Update - I changed cheroot to gunicorn and the memory leak was resolved.
in the considerations mentioned here - https://github.com/sse-secure-systems/connaisseur/blob/92affcfad4ce15a22328698d3af4a1b726217f65/docs/adr/ADR-7_wsgi-server.md#considered-options I see that gunicorn was an option that was considered but dropped due to bad performance compared with other options. Do we plan to do the tests again and change it to gunicorn then @phbelitz?
We are still seeing the memory leak issue in production and for us it has NOT been fixed even with v3.1.0 that has @jacobkoren1's fix (#1274)
We deployed v3.1.0 on Oct 11 in production, but observe from below that memory leak is still happening -