distribworks/dkron

Dkron can't be safely used in k8s at the moment

ivan-kripakov-m10 opened this issue · 16 comments

hi!

Is your feature request related to a problem? Please describe.
At the moment dkron cannot be safely used in k8s because dkron servers cannot handle IP changes.
To reproduce you can just deploy dkron using actual helm, shutdown the cluster and redeploy it.
Nodes will try to reconnect to each other using old IPs, but this process won't succeed.

Describe the solution you'd like
I think the consul-like approach can be used: hashicorp/consul#3403

Additional context
I'm not sure if this is the only problem with dkron in k8s (there is a hypothesis that you need to resolve todo - one and two, but I'm not sure - will share updates if any appears)
If you know of any other problems, I would suggest making a series of improvements aimed at supporting the work of dkron in k8s.
I think many people would like to have such an opportunity (I have seen many issues that are related to this in one way or another).

Possibly fixed in #1446

fopina commented

this looks similar to #1253
is it also fixed by #1446 ?
Looking forward to update to v4 and test it!

Hey can you try with v4.0.0-beta? this should be fixed by #1446

Hey, I have already tested #1446 (as I have written in my PR).
If anybody else is able to set up Dkron in some k8s cluster, I think it will be more sufficient as we will have at least two evidence that #1446 is a correct change.

Also there is a significant change in dkron k8s helm.
distribworks/dkron-helm#7
I tested the dkron build 3.2.6 with commits from #1446 using it.

@vcastellm are you going to merge it too?

And also we are for sure waiting for Dkron v4, but isn't it a good idea to release a patch version of Dkron 3.2.x (with #1446) to provide possibility to use Dkron in k8s now?

@ivan-kripakov-m10 it would be possible to release a patch version for v3 but I don't see any advantage of it. Can you elaborate on possible use cases of v3 vs v4?

fopina commented

Hey can you try with v4.0.0-beta? this should be fixed by #1446

@vcastellm not sure if I'm supposed to use any extra flags but 4.0.0-beta3 does not fix my issue #1253 (which I believe to be similar to this one)

After killing the server (to make it restart), agents report a log like this one

## inital join, all good
time="2024-02-11T13:36:59Z" level=info msg="Adding LAN adding server" node=sfpi4 server=dkron1
time="2024-02-11T13:36:59Z" level=info msg="agent: Received event" event=member-update node=pi4

## server (dkron1) killed, and removed from list, never retried
time="2024-02-11T13:49:49Z" level=info msg="agent: Received event" event=member-update node=pi4
time="2024-02-11T13:49:49Z" level=info msg="agent: Received event" event=member-failed node=pi4
time="2024-02-11T13:49:49Z" level=info msg="removing server dkron1 (Addr: 10.0.2.35:6868) (DC: dc1)" node=pi4

Docker swarm compose (to illustrate configuration)

services:
  server:
    image: dkron/dkron:4.0.0-beta3
    command: agent 
    environment:
      #DKRON_NODE_NAME: "{{.Node.Hostname}}"
      DKRON_NODE_NAME: dkron1
      DKRON_DATA_DIR: /ext/data
      DKRON_SERVER: 1
      DKRON_BIND_ADDR: tasks.server:8946
      DKRON_BOOTSTRAP_EXPECT: 1
    deploy:
      mode: replicated
      replicas: 1
  agents:
    image: dkron/dkron:4.0.0-beta3
    command: agent
    environment:
      DKRON_NODE_NAME: "{{.Node.Hostname}}"
      DKRON_RETRY_JOIN: tasks.server
      DKRON_BIND_ADDR: '{{`{{ GetInterfaceIP "eth0" }}:8946`}}'
      DKRON_TAG: 'arch={{.Node.Platform.Architecture}} server=false'
    deploy:
      mode: global

@vcastellm It appears that speed is the primary focus for me. From what I gather, version 4 will bring numerous modifications to both the user interface and backend. Implementing change #1446 and rolling out a release to enable users to utilize dkron in k8s seems like a more straightforward and quicker task comparing to the extensive v4 update.

If anybody else is able to set up Dkron in some k8s cluster, I think it will be more sufficient as we will have at least two evidence that #1446 is a correct change.

I converted a Dkron test instance with 3 servers and 2 agents to version 4.0.0-beta4. After that I deleted various pods several times, restarted the server's StatefulSet and so on. In all cases, the new pods reconnected correctly with the Dkron cluster, IP changes were handled, and leader selection worked.

jaccky commented

Hi,
we tried dkron/dkron:4.0.0-beta4 on an aks cluster, with 3 server nodes.
Various restarts of the nodes, always resulted in a working cluster with an elected leader.
So the issue seems to be finally solved !
Thanks to @ivan-kripakov-m10 for his work, I hope we can see this soon released in a stable version, I also hope a patch will be available for version 3 .

Is there a helm chart for V4?

@fabltd you can use helm from main branch from here: https://github.com/distribworks/dkron-helm

Does this install V4 ? Looking at the code it's V3?

hi @vcastellm!
It seems like the issue can be closed