Image-automation-controller crashes (every now and then) when interacting with libgit2 source
mathijshoogland opened this issue · 7 comments
I'm running flux on a (wel actually multiple) AKS cluster(s) on Azure, the source controller is connected to an azure devops repository that uses the libgit2 implementation (this is required by Azure). My controller versions are:
flux: v0.31.4
helm-controller: v0.22.2
image-automation-controller: v0.24.1
image-reflector-controller: v0.20.0
kustomize-controller: v0.27.0
notification-controller: v0.25.1
source-controller: v0.26.1
Within all my deployments, my image-automation-controller is crashing every now and then, I'm a bit puzzled on why this exactly is happening, but I think it is a bug, this is the stack trace of the latest crash:
{"level":"info","ts":"2022-08-17T09:01:54.448Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-08-17T09:01:54.449Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-08-17T09:01:54.449Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":":8080"}
{"level":"info","ts":"2022-08-17T09:01:54.449Z","msg":"Starting server","kind":"health probe","addr":":9440"}
I0817 09:01:54.550892 1 leaderelection.go:248] attempting to acquire leader lease flux-system/image-automation-controller-leader-election...
I0817 09:02:30.954933 1 leaderelection.go:258] successfully acquired lease flux-system/image-automation-controller-leader-election
{"level":"info","ts":"2022-08-17T09:02:30.955Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImageUpdateAutomation"}
{"level":"info","ts":"2022-08-17T09:02:30.955Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-08-17T09:02:30.955Z","logger":"controller.imageupdateautomation","msg":"Starting EventSource","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","source":"kind source: *v1beta1.ImagePolicy"}
{"level":"info","ts":"2022-08-17T09:02:30.955Z","logger":"controller.imageupdateautomation","msg":"Starting Controller","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation"}
{"level":"info","ts":"2022-08-17T09:02:31.056Z","logger":"controller.imageupdateautomation","msg":"Starting workers","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","worker count":4}
{"level":"info","ts":"2022-08-17T09:02:38.937Z","logger":"controller.imageupdateautomation","msg":"pushed commit to origin","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"plugins-iua-ress","namespace":"flux-system","revision":"28250e159c66c1228751d64be4ac4731a4113c9e","branch":"main"}
encoding/json.(*Decoder).refill(0xc0004baa00)
encoding/json/stream.go:165 +0x17f
encoding/json.(*Decoder).readValue(0xc0004baa00)
encoding/json/stream.go:140 +0xbb
encoding/json.(*Decoder).Decode(0xc0004baa00, {0x397fc0, 0xc00000f770})
encoding/json/stream.go:63 +0x78
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc0005d0db0, {0xc000174c00, 0x400, 0x400})
k8s.io/apimachinery@v0.24.3/pkg/util/framer/framer.go:152 +0x19c
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc000092280, 0xc0005521d0?, {0x847ff0, 0xc0005ce280})
k8s.io/apimachinery@v0.24.3/pkg/runtime/serializer/streaming/streaming.go:77 +0xa7
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00031c7e0)
k8s.io/client-go@v0.24.3/rest/watch/decoder.go:49 +0x4f
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0005ce240)
k8s.io/apimachinery@v0.24.3/pkg/watch/streamwatcher.go:105 +0xe5
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
k8s.io/apimachinery@v0.24.3/pkg/watch/streamwatcher.go:76 +0x130
goroutine 1926 [chan receive, 5 minutes]:
k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1()
k8s.io/client-go@v0.24.3/tools/record/event.go:304 +0x73
created by k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher
k8s.io/client-go@v0.24.3/tools/record/event.go:302 +0x8c
goroutine 30622 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0002f5690, 0x0)
runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0xc00005df98?)
sync/cond.go:56 +0x8c
golang.org/x/crypto/ssh.(*mux).Wait(0xc0002b8d20)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/mux.go:110 +0x99
golang.org/x/crypto/ssh.NewClient.func1()
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/client.go:62 +0x2a
created by golang.org/x/crypto/ssh.NewClient
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/client.go:61 +0x176
goroutine 30617 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d47856a40, 0x72)
runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc002a35780?, 0xc001944000?, 0x0)
internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Read(0xc002a35780, {0xc001944000, 0x1000, 0x1000})
internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc002a35780, {0xc001944000?, 0x822b103af1523324?, 0x3fc6b213fdccbde3?})
net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc000010020, {0xc001944000?, 0x0?, 0xc00189be20?})
net/net.go:183 +0x45
bufio.(*Reader).Read(0xc002182900, {0xc002b39521, 0x5, 0x20?})
bufio/bufio.go:236 +0x1b4
io.ReadAtLeast({0x83cd00, 0xc002182900}, {0xc002b39521, 0x5, 0x5}, 0x5)
io/io.go:331 +0x9a
io.ReadFull(...)
io/io.go:350
golang.org/x/crypto/ssh.(*streamPacketCipher).readCipherPacket(0xc002b39500, 0x9, {0x83cd00, 0xc002182900})
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/cipher.go:153 +0x5b
golang.org/x/crypto/ssh.(*connectionState).readPacket(0xc001b559e0, 0x3?)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/transport.go:130 +0x42
golang.org/x/crypto/ssh.(*transport).readPacket(0xc001b559e0)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/transport.go:114 +0x2c
golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc001b1ac60, 0x0)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/handshake.go:374 +0x3d
golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc001b1ac60)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/handshake.go:197 +0x45
created by golang.org/x/crypto/ssh.newClientTransport
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/handshake.go:134 +0x1f6
goroutine 30624 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0002f5810, 0x0)
runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x2a3e600?)
sync/cond.go:56 +0x8c
golang.org/x/crypto/ssh.(*buffer).Read(0xc002bc18e0, {0xc0017e0000, 0x8000, 0x8000})
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/buffer.go:94 +0x20a
golang.org/x/crypto/ssh.(*channel).ReadExtended(0xc000b67c80, {0xc0017e0000?, 0x8000?, 0x2ff600?}, 0x10201?)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/channel.go:351 +0x99
golang.org/x/crypto/ssh.(*channel).Read(0xc002182438?, {0xc0017e0000?, 0xc000010270?, 0xc000686f40?})
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/channel.go:528 +0x26
io.copyBuffer({0x840280, 0xc000010270}, {0x7f0d475fa9f8, 0xc000b67c80}, {0x0, 0x0, 0x0})
io/io.go:426 +0x1b2
io.Copy(...)
io/io.go:385
github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Action.func3()
github.com/fluxcd/source-controller@v0.26.0/pkg/git/libgit2/managed/ssh.go:260 +0x18d
created by github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Action
github.com/fluxcd/source-controller@v0.26.0/pkg/git/libgit2/managed/ssh.go:229 +0xa25
goroutine 30621 [chan receive]:
golang.org/x/crypto/ssh.(*Client).handleChannelOpens(0xc0021d69b0, 0xc001b1ac60?)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/client.go:151 +0x4d
created by golang.org/x/crypto/ssh.NewClient
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/client.go:60 +0x139
goroutine 30623 [chan receive]:
golang.org/x/crypto/ssh.(*Session).wait(0x0?, 0x0?)
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/session.go:424 +0x6c
golang.org/x/crypto/ssh.newSession.func1()
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/session.go:594 +0x26
created by golang.org/x/crypto/ssh.newSession
golang.org/x/crypto@v0.0.0-20220722155217-630584e8d5aa/ssh/session.go:593 +0xf9
Stream closed EOF for flux-system/image-automation-controller-597f457545-vm9lf (manager)
I've also ran the image-automation-controller with the log-level on debug and have a stacktrace (unfortunately not the entire part) here:
It seems that the controller fails because the ssh connection fails, but I'm not sure if that might trigger the crash?
"level":"error","ts":"2022-08-17T09:41:16.902Z","msg":"cannot fetch remote branch: ssh: handshake failed: read tcp 10.1.0.88:56620->40.74.28.9:22: read: connection reset by peer","name":"backend-services-iua-user-api","namespace":"flux-system","reconciler kind":"ImageUpdateAutomation","annotations":null,"error":"error","stacktrace":"github.com/fluxcd/pkg/runtime/events.(*Recorder).Eventf\n\tgithub.com/fluxcd/pkg/runtime@v0.16.2/events/recorder.go:113\ngithub.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).event\n\tgithub.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:923\ngithub.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile.func1\n\tgithub.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:173\ngithub.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile\n\tgithub.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"debug","ts":"2022-08-17T09:41:16.902Z","logger":"events","msg":"Warning","object":{"kind":"ImageUpdateAutomation","namespace":"flux-system","name":"backend-services-iua-user-api","uid":"3aedae84-1e1d-4765-b718-2daeb7298b11","apiVersion":"image.toolkit.fluxcd.io/v1beta1","resourceVersion":"126653846"},"reason":"error","message":"cannot fetch remote branch: ssh: handshake failed: read tcp 10.1.0.88:56620->40.74.28.9:22: read: connection reset by peer"}
{"level":"error","ts":"2022-08-17T09:41:16.932Z","logger":"controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"backend-services-iua-user-api","namespace":"flux-system","error":"cannot fetch remote branch: ssh: handshake failed: read tcp 10.1.0.88:56620->40.74.28.9:22: read: connection reset by peer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"debug","ts":"2022-08-17T09:41:17.683Z","logger":"controller.imageupdateautomation","msg":"fetching git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"backend-services-iua-user-api","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-apps"}}
{"level":"debug","ts":"2022-08-17T09:41:17.683Z","logger":"controller.imageupdateautomation","msg":"attempting to clone git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"backend-services-iua-user-api","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-apps"},"ref":{"branch":"main"},"working":"/tmp/flux-system-flux-apps2061718858"}
{"level":"debug","ts":"2022-08-17T09:41:19.051Z","logger":"controller.imageupdateautomation","msg":"updating with setters according to image policies","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"backend-services-iua-user-api","namespace":"flux-system","count":52,"manifests-path":"/tmp/flux-system-flux-apps2061718858/apps/backend-services/user-api/overlays/dev"}
{"level":"debug","ts":"2022-08-17T09:41:19.053Z","logger":"controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"backend-services-iua-user-api","namespace":"flux-system","working":"/tmp/flux-system-flux-apps2061718858"}
{"level":"debug","ts":"2022-08-17T09:41:19.062Z","logger":"controller.imageupdateautomation","msg":"no changes made in working directory; no commit","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"backend-services-iua-user-api","namespace":"flux-system"}
@mathijshoogland thanks for reporting this issue.
How often is it happening? Did this start after you upgraded to the latest version? And if so, what was the last version of the controller you were using?
I've had the issue before upgrading flux from v0.31.5 to v0.32.0, it happens more with clusters that have many image-update-automation requests. Today there were 20 restarts for a cluster with 75 IUAs (that have an interval of 1m0s).
I'm pretty confident it has something to do with the source controller that fails to fetch every now and then, but not sure how to investigate further.
What does the resources consumption (CPU/Memory) look like in the clusters that you can observe this?
Is there any correlation between peak resource consumption and crash times?
I've refactored my codebase bringing down the number of image update automations (from +/- 70 to 5). This brings down the restarts and high resource usage. I still see an error every now and then in the logs, but in contrary to the old situation the controller does not crash.
{"level":"error","ts":"2022-08-17T09:41:16.902Z","msg":"cannot fetch remote branch: ssh: handshake failed: read tcp 10.1.0.88:56620->40.74.28.9:22: read: connection reset by peer","name":"backend-services-iua-user-api","namespace":"flux-system","reconciler kind":"ImageUpdateAutomation","annotations":null,"error":"error","stacktrace":"github.com/fluxcd/pkg/runtime/events.(*Recorder).Eventf\n\tgithub.com/fluxcd/pkg/runtime@v0.16.2/events/recorder.go:113\ngithub.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).event\n\tgithub.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:923\ngithub.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile.func1\n\tgithub.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:173\ngithub.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile\n\tgithub.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
Should I close this issue?
Based on what you said and the graph shared, it looks like the controller was going beyond the default limits, which potentially was causing the crash.
The new error seems to be network related, which is orthogonal to the initially reported error so feel free to close the issue.