Scheduler Restart Causes dfdaemon and seed-peer to Report resolve Errors
Closed this issue · 3 comments
Bug report:
When the scheduler restarts, the dfdaemon and seed-peer will report errors saying they can't resolve the scheduler. At the same time, the scheduler will also report errors saying it can't resolve the seed-peer.
In the database, the deleted scheduler remains in an active state, while the newly created scheduler stays in an inactive state.
Here are the relevant logs:
-
scheduler log:
scheduler.log -
dfdaemon log:
dfdaemon.log -
seed-peer log:
seed-peer.log -
manager:
manager.log
Expected behavior:
All components (scheduler, dfdaemon, and seed-peer) should be able to resolve each other correctly and operate without errors after the scheduler restarts.
How to reproduce it:
- Deploy Dragonfly2 using Helm
- Restart the scheduler Pod
Environment:
- Dragonfly version: 2.1.30
- OS:
- Kernel (e.g.
uname -a
): - Others:
@yantingqiu Don't use the same hostname.
@yantingqiu Don't use the same hostname.
@gaius-qi When deploying Dragonfly2 in a containerized environment using StatefulSet, the Pods will retain their original hostnames after restarting, which seems difficult to change. Do you have any suggestions?
Additionally, the Pods that are killed should be marked as inactive, but the database continuously updates them to active.