intelligent-machine-learning/dlrover

DLRover: An Automatic Distributed Deep Learning System

PythonNOASSERTION

Issues

Flash Checkpoint incomplete saving
#1378 opened 23 days ago by xuLn-0813
3
The unittest cases to execut test_orphan_workers is too long.
#1381 opened a month ago by workingloong
0
Re-implement the master using golang.
#1374 opened a month ago by workingloong
0
llama2 test use the wrong activation function
#1351 opened a month ago by Monekyzoon
2
AttributeError: module 'collections' has no attribute 'Sequence'
#1332 opened 2 months ago by linzhidao1010
1
The controller manager restarts frequently
#1310 opened 2 months ago by sunjq1
0
Will flashcheckpoint support fully parallel save in megatron core 0.7+ ?
#1363 opened a month ago by leondada
1
How about migrating the tfplus and atorch as independent repositories?
#1358 opened a month ago by mingcheng
3
Support HTTP for master-worker communication.
#1366 opened a month ago by BalaBalaYi
0
dlrover-master被刪除 / dlrover-master was deleted
#1342 opened 2 months ago by zhangQiWorr
1
xpu timer python package
#1159 opened 7 months ago by zxyyzx
5
Question: How DLRover integrate with Llama Factory?
#1244 opened 4 months ago by hetingyou
2
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
#1243 opened a month ago by dotsonliu
2
Can you create a dlrover arm64 image for Ascend NPU?
#1248 opened 4 months ago by xmarker
2
client.connect(path) error when saving checkpoint
#1337 opened 2 months ago by atomrun39
7
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本
#1338 opened 2 months ago by lulu-0126
9
add xpu monitor for dlrover
#1290 opened 3 months ago by majieyue
2
About XPU_TIMER.
#1344 opened 2 months ago by BalaBalaYi
1
while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint
#1233 opened 5 months ago by deepcoldfish
3
megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
#1146 opened 7 months ago by Lzhang-hub
9
Could DLRover be able to apply to the diffusion transformer training? And combined with deepspeed?
#1314 opened 2 months ago by TomSuen
1
How does dlrover make sure all the nodes in one job are in one switch
#1298 opened 3 months ago by gangxie112
1
Enhance/Replace k8s python client.
#1291 opened 3 months ago by BalaBalaYi
1
Add balance loss in atorch moe example
#1300 opened 3 months ago by skydoorkai
0
easydl/elasticjob-controller:master image pull error
#1222 opened 2 months ago by xywangbuaa
3
make deploy 镜像拉取失败
#1333 opened 2 months ago by Ind1x1
0
Why model_optim_rng.pt is saved in a seperate directory?
#1225 opened 2 months ago by zhaoyang-star
9
scale down allreduct pytorch job won't complete and report error
#1215 opened 2 months ago by cocodee
3
[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model
#1199 opened 2 months ago by liangxuZhang
4
When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)
#1208 opened 2 months ago by lkq51
5
[observability] OTEL Trace/Event for training rendezvous, gpu check, flash checkpoint, etc.
#1132 opened 2 months ago by liyzcj
2
Incomplete save of ckpt files
#1135 opened 2 months ago by husky23333
5
example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml
#1136 opened 2 months ago by jason-i-vv
3
Error encountered when using falsh checkpoint
#1144 opened 2 months ago by chencjcj
3
How to use the elasticity and fault tolerance in a Volcano job.
#1172 opened 2 months ago by workingloong
3
Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master
#1175 opened 2 months ago by TheAriaYang
3
Error encountered while using flash attention in TensorFlow
#1180 opened 3 months ago by monatis
1
Flash checkpoint does not support safetensors
#1263 opened 3 months ago by Alex-Ruan
2
missing elastic_training_pb2
#1266 opened 3 months ago by NiushanDong
1
Erros in dlrover, after pip installed the dlrover package
#1260 opened 3 months ago by Desperadoze
3
DLRover - Flyte integration
#1275 opened 3 months ago by davidmirror-ops
2
deepspeed zero3 also save ckpt only in rank 0?
#1256 opened 4 months ago by Alex-Ruan
1
Why model_optim_rng.pt is not saved when enable dlrover?
#1223 opened 5 months ago by zhaoyang-star
0
transformers version?
#1221 opened 5 months ago by Alex-Ruan
1
Why checkpoint can't be copied to shared memory Asynchronously to shared memory when using Flash Checkpoint?
#1187 opened 6 months ago by Reflect0
1
What's the difference between MegatronCheckpointEngine and MegatronDistCheckpointEngine?
#1195 opened 6 months ago by liangxuZhang
0
Megatron-LM core_r0.6.0 TP=4 save ckpt raise RuntimeError: Fail to set metadata!
#1147 opened 7 months ago by SwordFaith
1
possible typo in the example of [tf_elasticjob_on_k8s]
#1123 opened 7 months ago by lichadehehehe
1
straggler-detection
#1138 opened 7 months ago by alex337
5
Error llama2 demo with pytorch 2.3.0
#1129 opened 8 months ago by SwordFaith
0