intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
PythonNOASSERTION
Issues
- 3
Flash Checkpoint incomplete saving
#1378 opened by xuLn-0813 - 0
- 0
Re-implement the master using golang.
#1374 opened by workingloong - 2
llama2 test use the wrong activation function
#1351 opened by Monekyzoon - 1
- 0
The controller manager restarts frequently
#1310 opened by sunjq1 - 1
- 3
- 0
Support HTTP for master-worker communication.
#1366 opened by BalaBalaYi - 1
dlrover-master被刪除 / dlrover-master was deleted
#1342 opened by zhangQiWorr - 5
xpu timer python package
#1159 opened by zxyyzx - 2
Question: How DLRover integrate with Llama Factory?
#1244 opened by hetingyou - 2
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
#1243 opened by dotsonliu - 2
Can you create a dlrover arm64 image for Ascend NPU?
#1248 opened by xmarker - 7
client.connect(path) error when saving checkpoint
#1337 opened by atomrun39 - 9
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本
#1338 opened by lulu-0126 - 2
add xpu monitor for dlrover
#1290 opened by majieyue - 1
About XPU_TIMER.
#1344 opened by BalaBalaYi - 3
while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint
#1233 opened by deepcoldfish - 9
megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
#1146 opened by Lzhang-hub - 1
Could DLRover be able to apply to the diffusion transformer training? And combined with deepspeed?
#1314 opened by TomSuen - 1
- 1
Enhance/Replace k8s python client.
#1291 opened by BalaBalaYi - 0
Add balance loss in atorch moe example
#1300 opened by skydoorkai - 3
easydl/elasticjob-controller:master image pull error
#1222 opened by xywangbuaa - 0
make deploy 镜像拉取失败
#1333 opened by Ind1x1 - 9
- 3
- 4
[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model
#1199 opened by liangxuZhang - 5
When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)
#1208 opened by lkq51 - 2
[observability] OTEL Trace/Event for training rendezvous, gpu check, flash checkpoint, etc.
#1132 opened by liyzcj - 5
Incomplete save of ckpt files
#1135 opened by husky23333 - 3
- 3
Error encountered when using falsh checkpoint
#1144 opened by chencjcj - 3
- 3
Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master
#1175 opened by TheAriaYang - 1
- 2
Flash checkpoint does not support safetensors
#1263 opened by Alex-Ruan - 1
missing elastic_training_pb2
#1266 opened by NiushanDong - 3
- 2
DLRover - Flyte integration
#1275 opened by davidmirror-ops - 1
deepspeed zero3 also save ckpt only in rank 0?
#1256 opened by Alex-Ruan - 0
- 1
transformers version?
#1221 opened by Alex-Ruan - 1
Why checkpoint can't be copied to shared memory Asynchronously to shared memory when using Flash Checkpoint?
#1187 opened by Reflect0 - 0
What's the difference between MegatronCheckpointEngine and MegatronDistCheckpointEngine?
#1195 opened by liangxuZhang - 1
Megatron-LM core_r0.6.0 TP=4 save ckpt raise RuntimeError: Fail to set metadata!
#1147 opened by SwordFaith - 1
- 5
straggler-detection
#1138 opened by alex337 - 0
Error llama2 demo with pytorch 2.3.0
#1129 opened by SwordFaith