sse-secure-systems/connaisseur

Slow memory leak causing cosign subprocess termination

sbeck14 opened this issue · 12 comments

Describe the bug

We have connaisseur 2.8.0 deployed in Kubernetes using a Helm install, and after running normally for a while (between a few days to a week) we start seeing Connaisseur reject images and producing these errors in the logs:

INFO: COSIGN output of trust root 'default' for image'[redacted]': RETURNCODE: -9; STDOUT: ; STDERR:
INFO: {'message': 'Unexpected Cosign exception for image "[redacted]": .', 'context': {'trust_data_type': 'dev.cosignproject.cosign/signature', 'stderr': '', 'image': '[redacted]', 'trust_root': 'default', 'detection_mode': False}}

Upon inspection of the memory usage of the Connaisseur pods, it appears that the memory usage of the Connaisseur process slowly rises over time, indicating a memory leak.

Result of ps command on a pod immediately after launching

PID   USER     VSZ  RSS  COMMAND          COMMAND
    1 10001     60m  45m python           python -m connaisseur
   18 10001    1684 1072 sh               /bin/sh
   24 10001    1608  984 ps               ps -o pid,user,vsz,rss,comm,args

Result of ps command on a pod running for ~7 days


PID   USER     VSZ  RSS  COMMAND          COMMAND
    1 10001    317m 302m python           python -m connaisseur
 4827 10001    1692 1060 sh               /bin/sh
 4835 10001    1616  984 ps               ps -o pid,user,vsz,rss,comm,args

Once the memory usage approaches the memory limit on the pod, our theory is that Kubernetes begins killing the cosign sub-process immediately after it spawns because it causes the pod to exceed the memory limit. Afterwards, the main Connaisseur process reports the error and the memory usage continues growing until Kubernetes finally kills the main process, restarting the pod.

Expected behavior

Connaisseur process memory does not grow and stays about the same over time

Optional: To reproduce

I haven't been able to reliably reproduce this without just waiting for the memory to grow over time. I tried restarting the deployment and hitting the API with an AdmissionReview 3000x in a row, but the memory usage only increased by about 10 mb. I also tried this with different types of AdmissionReview (static verifier, cosign, and automatic child approval) with similar results.

Optional: Versions (please complete the following information as relevant):

  • Kubernetes Cluster: Kubernetes v1.21.9
  • Container registry: Gitlab Enterprise
  • Connaisseur: v2.8.0

Here are our policies & validators from our helm values.yaml:

validators:
  - name: default
    type: static
    approve: true
  - name: gitlab
    type: cosign
    auth:
      secret_name: gitlab
    trust_roots:
    - name: default
      key: awskms:///[redacted]
policy:
- pattern: "*:*"
- pattern: "[redacted domain]/*:*"
  validator: gitlab
  with:
    trust_root: "*"
    threshold: 1

I'll try to create a setup to stress test Connaisseur and see, whether I can replicate the problem. Other than that, with Connaisseur 3.0 we switched up the parallelization from concurrent.futures back to asyncio, maybe that did the trick 🤷

I'll see what I can do.

This is great news! I’ll test out version 3.0 in our environment and report back. Thank you!

@phbelitz Issue does not appear to have been resolved with version 3.0, we are still experiencing the memory leak after the update

Thought as much ... I'll try to look into it, but can't make promises. On the plus side: we are thinking about a rewrite in golang, where things hopefully will be a lot more efficient. This isn't set in stone yet, as a rewrite is quite costly, but I will keep you posted on any news :)

I'm experiencing the same memory leak issue in a production environment. I'm currently using Connaisseur version 3.0.0. This memory leak is a critical concern as it can lead to serious malfunctions and impact the stability of our production environment. I strongly urge prioritizing the treatment of this issue. please!

Note that Increasing the interval of readiness and liveness probes has resulted in a slower memory leak. As these endpoints have no specific functionality (simply returning a 200 status), I suspect it could be an issue related to Flask or maybe even to Cheroot that wraps it.

Update - I changed cheroot to gunicorn and the memory leak was resolved.
I also changed the usage of asyncio.run() as mentioned here

@jacobkoren1 would you be so kind and open a PR with you fixes?

@jacobkoren1 would you be so kind and open a PR with you fixes?
Sure

dbbhat commented

Update - I changed cheroot to gunicorn and the memory leak was resolved.

in the considerations mentioned here - https://github.com/sse-secure-systems/connaisseur/blob/92affcfad4ce15a22328698d3af4a1b726217f65/docs/adr/ADR-7_wsgi-server.md#considered-options I see that gunicorn was an option that was considered but dropped due to bad performance compared with other options. Do we plan to do the tests again and change it to gunicorn then @phbelitz?

We are still seeing the memory leak issue in production and for us it has NOT been fixed even with v3.1.0 that has @jacobkoren1's fix (#1274)

We deployed v3.1.0 on Oct 11 in production, but observe from below that memory leak is still happening -

image