Image-reflector-controller restarts due to OOM Killed

Question

Image-reflector-controller restarts due to OOM Killed

Andrea-Gallicchio opened this issue 2 years ago · 5 comments

Describe the bug

I run Flux on AWS EKS 1.21.5. I've noticed that after the last Flux update, sometimes happens that the image-reflector-controller pod is restarted due to OOM Killed, even if it has a high CPU and Memory Request/Limit. The number of Helm Releases is between 30 and 40.

CPU Request: 0.05
CPU Limit: 0.1
CPU Average Usage: 0.006
Memory Request: 384 MB
Memory Limit: 640 MB
Memory Average Usage: 187 MB

Steps to reproduce

N/A

Expected behavior

I expect image-reflector-controller to not restart due to OOM Killed.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v0.31.3

Flux check

► checking prerequisites
✔ Kubernetes 1.21.12-eks-a64ea69 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.21.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.22.1
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.18.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.25.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.23.5
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.24.4
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta1
✔ buckets.source.toolkit.fluxcd.io/v1beta1
✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta1
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1
✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1
✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1
✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta1
✔ providers.notification.toolkit.fluxcd.io/v1beta1
✔ receivers.notification.toolkit.fluxcd.io/v1beta1
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Answer 1 · 2022-07-13T10:53:35.000Z

The image-reflector-controller has nothing to do with Helm. Can you please post here kubectl describe deployment for the controller that runs into OOM.

Answer 2 · 2022-07-13T12:23:01.000Z

Name:                   image-reflector-controller
Namespace:              flux-system
CreationTimestamp:      Thu, 23 Dec 2021 11:29:24 +0100
Labels:                 app.kubernetes.io/instance=flux-system
                        app.kubernetes.io/part-of=flux
                        app.kubernetes.io/version=v0.30.2
                        control-plane=controller
                        kustomize.toolkit.fluxcd.io/name=flux-system
                        kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:            deployment.kubernetes.io/revision: 6
Selector:               app=image-reflector-controller
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=image-reflector-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  image-reflector-controller
  Containers:
   manager:
    Image:       ghcr.io/fluxcd/image-reflector-controller:v0.18.0
    Ports:       8080/TCP, 9440/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    Limits:
      cpu:     100m
      memory:  640Mi
    Requests:
      cpu:      50m
      memory:   384Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:   (v1:metadata.namespace)
    Mounts:
      /data from data (rw)
      /tmp from temp (rw)
  Volumes:
   temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   image-reflector-controller-db97c765d (1/1 replicas created)
Events:          <none>

Answer 3 · 2022-08-08T16:11:51.000Z

@Andrea-Gallicchio can you confirm whether just before the OOM occurred there was anything abnormal in the logs?

Answer 4 · 2023-12-26T08:38:48.000Z

We regularly reproduce the problem

Before OOM kill there is nothing unusual, it's just regular scanning for new tags

2023-12-26T06:45:47+04:00	{"level":"info","ts":"2023-12-26T02:45:47.803Z","msg":"Latest image tag for 'public.ecr.aws/gravitational/teleport-distroless' resolved to 14.2.4","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"teleport","namespace":"flux-system"},"namespace":"flux-system","name":"teleport","reconcileID":"4f4771ff-7dd2-4b8e-9803-075f0a2460c4"}
2023-12-26T06:45:41+04:00	{"level":"info","ts":"2023-12-26T02:45:41.332Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"7a100653-c4f0-45c6-aac6-a15f09f01de6"}
2023-12-26T06:45:41+04:00	{"level":"info","ts":"2023-12-26T02:45:41.312Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"f0a9b715-e8ed-46fd-972a-e4852b2746a2"}
2023-12-26T06:45:41+04:00	{"level":"info","ts":"2023-12-26T02:45:41.296Z","msg":"no new tags found, next scan in 5m0s","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"9c4644d4-bed0-4c0a-ab90-d74fa197f61c"}

But the main problem is that after the OOM kill, the container can't recover and enters the CrashloopBackOff state

Here are the logs for the container starting after the OOM kill

2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.414Z","logger":"runtime","msg":"attempting to acquire leader lease flux-system/image-reflector-controller-leader-election...\n"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.413Z","msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.309Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.308Z","logger":"setup","msg":"starting manager"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.302Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: Deleting empty file: /data/000004.vlog
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: Set nextTxnTs to 1657
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: Discard stats nextEmptySlot: 0
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: All 0 tables opened in 0s

Answer 5 · 2024-05-07T08:01:26.000Z

We are seeing the same issue, despite increasing memory requests and limits: currently at 512M/1G.

Nothing in the logs just before the OOMKilled (which happened on Mon, 06 May 2024 22:17:39 +0100).