volcengine/veScale

A PyTorch Native LLM Training Framework

PythonApache-2.0

Issues

[ENHANCEMENT] Extending to DeepSpeed
#57 opened a month ago by moghadas76
0
[QUESTION]How is vescale zero2 implemented?
#54 opened 2 months ago by starstream
5
[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment
#55 opened 2 months ago by zmtttt
4
[QUESTION] questions about Collective Communication Group Initialization Optimization in the paper
#40 opened 2 months ago by siddharthaOnRoad
2
[QUESTION]How to use MQhandler for muti machines？
#56 opened 2 months ago by zmtttt
2
> Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:
#53 opened 2 months ago by zmtttt
3
[QUESTION]Using ndtimeline-tool to Monitor Megatron-GPT
#51 opened 2 months ago by zmtttt
1
[RFC] Single-Device-Abstract DDP
#52 opened 3 months ago by lllukehuang
1
The times for forward-compute and backward-compute captured by the ndtimeline-tool are inaccurate
#47 opened 3 months ago by zmtttt
10
[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`
#46 opened 3 months ago by nooblyh
5
[QUESTION] How flexible is the veScale checkpoint
#43 opened 4 months ago by Dream-Seeker123
3
[QUESTION] how and where to use multi-node trace profiler in paper of megascale
#37 opened 4 months ago by oliverYoung2001
3
[RFC] veScale: High-Level API for nD Parallel Training
#39 opened 6 months ago by leonardo0lyj
0
[QUESTION]`vescale.dtensor` vs "PyTorch DTensor"
#28 opened 8 months ago by GHGmc2
3
[QUESTION] Save checkpoint
#26 opened 8 months ago by Ryanuppp
1
[QUESTION] will the patch code merged into upstream?
#17 opened 8 months ago by ultranity
2
Code Example & Docs
#14 opened 8 months ago by ultranity
2