Scheduler Restart Causes dfdaemon and seed-peer to Report resolve Errors

Question

Scheduler Restart Causes dfdaemon and seed-peer to Report resolve Errors

Closed this issue 4 months ago · 3 comments

Bug report:

When the scheduler restarts, the dfdaemon and seed-peer will report errors saying they can't resolve the scheduler. At the same time, the scheduler will also report errors saying it can't resolve the seed-peer.

In the database, the deleted scheduler remains in an active state, while the newly created scheduler stays in an inactive state.

Here are the relevant logs:

scheduler log:
scheduler.log
dfdaemon log:
dfdaemon.log
seed-peer log:
seed-peer.log
manager:
manager.log

Expected behavior:

All components (scheduler, dfdaemon, and seed-peer) should be able to resolve each other correctly and operate without errors after the scheduler restarts.

How to reproduce it:

Deploy Dragonfly2 using Helm
Restart the scheduler Pod

Environment:

Dragonfly version: 2.1.30
OS:
Kernel (e.g. uname -a):
Others:

Answer 1 · 2024-05-28T02:39:32.000Z

@yantingqiu Don't use the same hostname.

Answer 2 · 2024-05-28T02:49:49.000Z

@yantingqiu Don't use the same hostname.

@gaius-qi When deploying Dragonfly2 in a containerized environment using StatefulSet, the Pods will retain their original hostnames after restarting, which seems difficult to change. Do you have any suggestions?

Answer 3 · 2024-05-28T03:21:33.000Z

Additionally, the Pods that are killed should be marked as inactive, but the database continuously updates them to active.