dragonflyoss/Dragonfly2

No caching for multinode jobs

XRFXLP opened this issue · 0 comments

Bug report:

I've apply this job spec in kind cluster with dragonfly installed in it:

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: tfjob-simple
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: ollama/ollama:latest

But it takes ~40 seconds for both the replica to start:
When I checked the logs of actual latencies:
replica 1:

# cat /var/log/dragonfly/daemon/core.log | grep "peer task done"
{"level":"info","ts":"2024-05-06 05:46:59.534","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 4996ms","peer":"10.244.2.10-7191-cbd4d8d8-0e8c-45fe-878d-e11a2fd510cd","task":"5e7266a546ad1b38e004b25644e55a5690144d08df8292a92c930e5b6c812122","component":"PeerTask","trace":"379dab5612981684d2348a5261cdc537"}
{"level":"info","ts":"2024-05-06 05:47:06.312","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 6736ms","peer":"10.244.2.10-7191-19c84662-e4b3-4022-b49d-4e9ff8b07f4d","task":"c29e0cfd160afeb47c93114dac063be4e562bce56fd769fc9826c5c6afd6652a","component":"PeerTask","trace":"15c40aff97bb16922eeba97ca517e84e"}
{"level":"info","ts":"2024-05-06 05:47:23.859","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 24292ms","peer":"10.244.2.10-7191-ad06d2ac-a0ce-4719-8fcb-091fe4b83c5b","task":"9e8b31d5424d9e3f0d49c4c4452afde2d15e34510ffefebdbe13f34fcfffc257","component":"PeerTask","trace":"bae802ac1dd2293825a9fb1fde2bb61a"}
{"level":"info","ts":"2024-05-06 05:47:28.965","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 29407ms","peer":"10.244.2.10-7191-b8eceaa4-b056-4876-98d3-ecdddfe076cc","task":"afca0a7d91e6980a4586d32192337a8faad03ca9561d781710f24a3fe021225b","component":"PeerTask","trace":"e43ca5b3587512a31b41fc02021adb0c"}

replica 2:

 # cat /var/log/dragonfly/daemon/core.log | grep "peer task done"
{"level":"info","ts":"2024-05-06 05:46:59.536","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 1679ms","peer":"10.244.1.11-8058-90625bf5-f148-41d5-bc97-1032fb3ab3b8","task":"5e7266a546ad1b38e004b25644e55a5690144d08df8292a92c930e5b6c812122","component":"PeerTask","trace":"131c4becdc24e81feb3e71f42b083b41"}
{"level":"info","ts":"2024-05-06 05:47:10.238","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 10661ms","peer":"10.244.1.11-8058-a60ab0b1-4bf0-4fed-9025-80b191368382","task":"9e8b31d5424d9e3f0d49c4c4452afde2d15e34510ffefebdbe13f34fcfffc257","component":"PeerTask","trace":"7b3ecbc4de8bb2a9f8ee73f991ef3680"}
{"level":"info","ts":"2024-05-06 05:47:14.017","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 14447ms","peer":"10.244.1.11-8058-05b9e45e-de87-4d16-9168-cfbdab42eb44","task":"c29e0cfd160afeb47c93114dac063be4e562bce56fd769fc9826c5c6afd6652a","component":"PeerTask","trace":"4b604edae7e87f5c5239b0c00b5a0097"}
{"level":"info","ts":"2024-05-06 05:47:21.952","caller":"peer/peertask_conductor.go:1383","msg":"peer task done, cost: 22392ms","peer":"10.244.1.11-8058-f0d35f62-6200-47bf-afd9-1a541e4286d0","task":"afca0a7d91e6980a4586d32192337a8faad03ca9561d781710f24a3fe021225b","component":"PeerTask","trace":"85b1d0fded48d275fee9c6947048c3cc"}

Logs:
corelog-1.log
corelog-2.log
scheduler-core.log
seed-peer.log

But otherwise this is working, as I verified after manually pulling into a different node, the latency in that case is significantly lower.

Expected behavior:

I expected that there should not be any duplicate back-to-source downloads

How to reproduce it:

  1. Create a kubernetes cluster, install dragonfly in it as per the instruction given here
  2. Install kubeflow training operator:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
  1. Apply the TFJob spec above by: kubectl apply -f job-spec.yaml
  2. Check for the pulling time in kubectl describe pods
  3. Or additionally check the latencies by execing into the dfdaemon pods: cat /var/log/dragonfly/daemon/core.log | grep "peer task done"

Environment:

  • Dragonfly version: chart: dragonfly-1.1.54
  • OS: Ubuntu-WSL
  • Kernel (e.g. uname -a): Linux NV-DNGSJL3 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux