Dkron can't be safely used in k8s at the moment
ivan-kripakov-m10 opened this issue · 16 comments
hi!
Is your feature request related to a problem? Please describe.
At the moment dkron cannot be safely used in k8s because dkron servers cannot handle IP changes.
To reproduce you can just deploy dkron using actual helm, shutdown the cluster and redeploy it.
Nodes will try to reconnect to each other using old IPs, but this process won't succeed.
Describe the solution you'd like
I think the consul-like approach can be used: hashicorp/consul#3403
Additional context
I'm not sure if this is the only problem with dkron in k8s (there is a hypothesis that you need to resolve todo - one and two, but I'm not sure - will share updates if any appears)
If you know of any other problems, I would suggest making a series of improvements aimed at supporting the work of dkron in k8s.
I think many people would like to have such an opportunity (I have seen many issues that are related to this in one way or another).
Also there is a significant change in dkron k8s helm.
distribworks/dkron-helm#7
I tested the dkron build 3.2.6 with commits from #1446 using it.
@vcastellm are you going to merge it too?
And also we are for sure waiting for Dkron v4, but isn't it a good idea to release a patch version of Dkron 3.2.x (with #1446) to provide possibility to use Dkron in k8s now?
@ivan-kripakov-m10 it would be possible to release a patch version for v3 but I don't see any advantage of it. Can you elaborate on possible use cases of v3 vs v4?
Hey can you try with v4.0.0-beta? this should be fixed by #1446
@vcastellm not sure if I'm supposed to use any extra flags but 4.0.0-beta3 does not fix my issue #1253 (which I believe to be similar to this one)
After killing the server (to make it restart), agents report a log like this one
## inital join, all good
time="2024-02-11T13:36:59Z" level=info msg="Adding LAN adding server" node=sfpi4 server=dkron1
time="2024-02-11T13:36:59Z" level=info msg="agent: Received event" event=member-update node=pi4
## server (dkron1) killed, and removed from list, never retried
time="2024-02-11T13:49:49Z" level=info msg="agent: Received event" event=member-update node=pi4
time="2024-02-11T13:49:49Z" level=info msg="agent: Received event" event=member-failed node=pi4
time="2024-02-11T13:49:49Z" level=info msg="removing server dkron1 (Addr: 10.0.2.35:6868) (DC: dc1)" node=pi4
Docker swarm compose (to illustrate configuration)
services:
server:
image: dkron/dkron:4.0.0-beta3
command: agent
environment:
#DKRON_NODE_NAME: "{{.Node.Hostname}}"
DKRON_NODE_NAME: dkron1
DKRON_DATA_DIR: /ext/data
DKRON_SERVER: 1
DKRON_BIND_ADDR: tasks.server:8946
DKRON_BOOTSTRAP_EXPECT: 1
deploy:
mode: replicated
replicas: 1
agents:
image: dkron/dkron:4.0.0-beta3
command: agent
environment:
DKRON_NODE_NAME: "{{.Node.Hostname}}"
DKRON_RETRY_JOIN: tasks.server
DKRON_BIND_ADDR: '{{`{{ GetInterfaceIP "eth0" }}:8946`}}'
DKRON_TAG: 'arch={{.Node.Platform.Architecture}} server=false'
deploy:
mode: global
@vcastellm It appears that speed is the primary focus for me. From what I gather, version 4 will bring numerous modifications to both the user interface and backend. Implementing change #1446 and rolling out a release to enable users to utilize dkron in k8s seems like a more straightforward and quicker task comparing to the extensive v4 update.
If anybody else is able to set up Dkron in some k8s cluster, I think it will be more sufficient as we will have at least two evidence that #1446 is a correct change.
I converted a Dkron test instance with 3 servers and 2 agents to version 4.0.0-beta4. After that I deleted various pods several times, restarted the server's StatefulSet and so on. In all cases, the new pods reconnected correctly with the Dkron cluster, IP changes were handled, and leader selection worked.
Hi,
we tried dkron/dkron:4.0.0-beta4 on an aks cluster, with 3 server nodes.
Various restarts of the nodes, always resulted in a working cluster with an elected leader.
So the issue seems to be finally solved !
Thanks to @ivan-kripakov-m10 for his work, I hope we can see this soon released in a stable version, I also hope a patch will be available for version 3 .
Is there a helm chart for V4?
@fabltd you can use helm from main branch from here: https://github.com/distribworks/dkron-helm
Does this install V4 ? Looking at the code it's V3?
you can change dkron version here: https://github.com/distribworks/dkron-helm/blob/c57a99f7cf75d1f49e6290a6351280c57ce21356/dkron/values.yaml#L9
hi @vcastellm!
It seems like the issue can be closed