Image-reflector-controller restarts due to OOM Killed
Andrea-Gallicchio opened this issue · 5 comments
Describe the bug
I run Flux on AWS EKS 1.21.5. I've noticed that after the last Flux update, sometimes happens that the image-reflector-controller
pod is restarted due to OOM Killed
, even if it has a high CPU and Memory Request/Limit. The number of Helm Releases is between 30 and 40.
- CPU Request: 0.05
- CPU Limit: 0.1
- CPU Average Usage: 0.006
- Memory Request: 384 MB
- Memory Limit: 640 MB
- Memory Average Usage: 187 MB
Steps to reproduce
N/A
Expected behavior
I expect image-reflector-controller
to not restart due to OOM Killed
.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v0.31.3
Flux check
► checking prerequisites
✔ Kubernetes 1.21.12-eks-a64ea69 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.21.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.22.1
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.18.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.25.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.23.5
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.24.4
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta1
✔ buckets.source.toolkit.fluxcd.io/v1beta1
✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta1
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1
✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1
✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1
✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta1
✔ providers.notification.toolkit.fluxcd.io/v1beta1
✔ receivers.notification.toolkit.fluxcd.io/v1beta1
✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
The image-reflector-controller
has nothing to do with Helm. Can you please post here kubectl describe deployment
for the controller that runs into OOM.
Name: image-reflector-controller
Namespace: flux-system
CreationTimestamp: Thu, 23 Dec 2021 11:29:24 +0100
Labels: app.kubernetes.io/instance=flux-system
app.kubernetes.io/part-of=flux
app.kubernetes.io/version=v0.30.2
control-plane=controller
kustomize.toolkit.fluxcd.io/name=flux-system
kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations: deployment.kubernetes.io/revision: 6
Selector: app=image-reflector-controller
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=image-reflector-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: image-reflector-controller
Containers:
manager:
Image: ghcr.io/fluxcd/image-reflector-controller:v0.18.0
Ports: 8080/TCP, 9440/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--events-addr=http://notification-controller.flux-system.svc.cluster.local./
--watch-all-namespaces=true
--log-level=info
--log-encoding=json
--enable-leader-election
Limits:
cpu: 100m
memory: 640Mi
Requests:
cpu: 50m
memory: 384Mi
Liveness: http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
RUNTIME_NAMESPACE: (v1:metadata.namespace)
Mounts:
/data from data (rw)
/tmp from temp (rw)
Volumes:
temp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: image-reflector-controller-db97c765d (1/1 replicas created)
Events: <none>
@Andrea-Gallicchio can you confirm whether just before the OOM occurred there was anything abnormal in the logs?
We regularly reproduce the problem
Before OOM kill there is nothing unusual, it's just regular scanning for new tags
2023-12-26T06:45:47+04:00 {"level":"info","ts":"2023-12-26T02:45:47.803Z","msg":"Latest image tag for 'public.ecr.aws/gravitational/teleport-distroless' resolved to 14.2.4","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"teleport","namespace":"flux-system"},"namespace":"flux-system","name":"teleport","reconcileID":"4f4771ff-7dd2-4b8e-9803-075f0a2460c4"}
2023-12-26T06:45:41+04:00 {"level":"info","ts":"2023-12-26T02:45:41.332Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"7a100653-c4f0-45c6-aac6-a15f09f01de6"}
2023-12-26T06:45:41+04:00 {"level":"info","ts":"2023-12-26T02:45:41.312Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"f0a9b715-e8ed-46fd-972a-e4852b2746a2"}
2023-12-26T06:45:41+04:00 {"level":"info","ts":"2023-12-26T02:45:41.296Z","msg":"no new tags found, next scan in 5m0s","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"9c4644d4-bed0-4c0a-ab90-d74fa197f61c"}
But the main problem is that after the OOM kill, the container can't recover and enters the CrashloopBackOff state
Here are the logs for the container starting after the OOM kill
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.414Z","logger":"runtime","msg":"attempting to acquire leader lease flux-system/image-reflector-controller-leader-election...\n"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.413Z","msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.309Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.308Z","logger":"setup","msg":"starting manager"}
2023-12-26T06:48:44+04:00 {"level":"info","ts":"2023-12-26T02:48:44.302Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: Deleting empty file: /data/000004.vlog
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: Set nextTxnTs to 1657
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: Discard stats nextEmptySlot: 0
2023-12-26T06:48:44+04:00 badger 2023/12/26 02:48:44 INFO: All 0 tables opened in 0s