dragonflyoss/Dragonfly2

Scheduler Restart Causes dfdaemon and seed-peer to Report resolve Errors

Closed this issue · 3 comments

Bug report:

When the scheduler restarts, the dfdaemon and seed-peer will report errors saying they can't resolve the scheduler. At the same time, the scheduler will also report errors saying it can't resolve the seed-peer.

In the database, the deleted scheduler remains in an active state, while the newly created scheduler stays in an inactive state.
image

Here are the relevant logs:

  1. scheduler log:
    scheduler.log

  2. dfdaemon log:
    dfdaemon.log

  3. seed-peer log:
    seed-peer.log

  4. manager:
    manager.log

Expected behavior:

All components (scheduler, dfdaemon, and seed-peer) should be able to resolve each other correctly and operate without errors after the scheduler restarts.

How to reproduce it:

  1. Deploy Dragonfly2 using Helm
  2. Restart the scheduler Pod

Environment:

  • Dragonfly version: 2.1.30
  • OS:
  • Kernel (e.g. uname -a):
  • Others:

@yantingqiu Don't use the same hostname.

@yantingqiu Don't use the same hostname.

@gaius-qi When deploying Dragonfly2 in a containerized environment using StatefulSet, the Pods will retain their original hostnames after restarting, which seems difficult to change. Do you have any suggestions?

Additionally, the Pods that are killed should be marked as inactive, but the database continuously updates them to active.