NVIDIA/Megatron-LM

Ongoing research training transformer models at scale

PythonNOASSERTION

Issues

[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?
#1333 opened 21 hours ago by IgorZan
0
[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing
#1330 opened 3 days ago by thuwzt
0
[BUG]megatron-lm，with torchompile，The provided qkv memory layout is not supported!
#1329 opened 3 days ago by qingshanxwx
0
[QUESTION] Why doesn't GPTDataset build a global shuffle index?
#1328 opened 3 days ago by dynamicheart
0
[QUESTION] recompute activation rather consume more memory while backward (OOM)
#1300 opened 14 days ago by KookHoiKim
2
[BUG] Precision issue caused by different token dispatchers in MoE training
#1327 opened 6 days ago by qi7kuo
0
[QUESTION] About using StreamingLLM
#1326 opened 6 days ago by zhangyilalala
0
[QUESTION] scaleing MFU calculate
#1276 opened 6 days ago by ltm920716
1
[BUG] Source for releases 0.8.0 and 0.9.0 are not available
#1307 opened 7 days ago by dmgcsilva
2
[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss
#1324 opened 7 days ago by cailun01
0
[QUESTION] How to specify the implementation of Attention？
#1313 opened 17 days ago by renyinCheng001
2
[BUG] FSDP requires torch optimizer, not transformer_engine or apex
#1322 opened 8 days ago by prrathi
0
[BUG] validate_yaml() isn't in sync with arguments check
#1297 opened a month ago by pierric
2
[QUESTION] Why is the initialization of the router and experts different in the MoE part?
#1302 opened a month ago by mxymxy77
1
[QUESTION] I encountered the following issue when executing your command. What could be the cause? args.exit_on_missing_checkpoint is: True >> '--exit-on-missing-checkpoint' set ... exiting. <<
#1317 opened 13 days ago by Alinanini
0
[QUESTION] How to convert torch_dist format checkpoint to torch format?
#1291 opened a month ago by zhangyilalala
1
[QUESTION]Does Megatron support tracing computation graphs with torch.fx?
#1315 opened 16 days ago by fy-j
0
[BUG] When using LLaVA with freeze-LM, training text only sample occurs error.
#1314 opened 17 days ago by liveseongho
0
[QUESTION] Gradient Propagation in backward pass
#1312 opened 18 days ago by arul-lm
0
[QUESTION]UnboundLocalError：local variable ‘output tensor’ referenced before assignmnet
#1311 opened 18 days ago by zmtttt
0
[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error.
#1310 opened 18 days ago by bphwk
0
[QUESTION]
#1308 opened 21 days ago by eliird
0
[BUG] The problem of splitting transformer layers when pipeline parallelism cannot be evenly divided.
#1304 opened a month ago by Baibaifan
0
[QUESTION] How to split the Transform layer when the pipeline is uneven?
#1303 opened a month ago by renyinCheng001
1
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel
#1292 opened a month ago by SeunghyunSEO
6
[BUG] an illegal memory access was encountered in MOE-MLP(GroupGemm)
#1301 opened a month ago by hgdhrt
0
[BUG] The cached_loss_mask cannot be consistent
#1298 opened a month ago by XLzed
2
[BUG] Segmentation fault: address not mapped to object at address (nil) while use recompute granularity option
#1299 opened a month ago by KookHoiKim
0
[BUG] LLaVA may fail with EPP0 PP>1
#1293 opened a month ago by lostkevin
0
[QUESTION] deepseek v2 compatility?
#1295 opened a month ago by wavy-jung
0
[BUG] training crash when set --tp-comm-overlap
#1274 opened a month ago by ltm920716
12
[BUG] Encountering NaN gradients when using CUDA Graph
#1279 opened a month ago by DXZDXZ
1
[QUESTION] NVIDIA Megatron Core 0.9.0 does not have shared_experts.py
#1257 opened a month ago by clarence-lee-sheng
3
[QUESTION] SGD support in distrib_optimizer.py
#1287 opened a month ago by zstreeter
0
[BUG]Megatron-LM doesn't support transformer-engine 1.13
#1280 opened a month ago by klhhhhh
1
[QUESTION] There is already a 32-bit model parameter in the optimizer state. Why do we need to store a separate copy of the model parameters in the checkpoint?
#1283 opened a month ago by leondada
0
Where can I download the tokenizer for the model mcore-llava-mistral-7b-instruct-clip336-pretraining?
#1281 opened a month ago by herolxl
0
[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?
#1277 opened 2 months ago by Louis-J
0
[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` .
#1275 opened 2 months ago by wplf
0
[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset?
#1269 opened 2 months ago by shmily326
0
[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed
#1259 opened 2 months ago by efsotr
1
[QUESTION] How to Visualize Computational Graph
#1272 opened 2 months ago by zixianwang2022
0
[QUESTION] How to use loader_mcore and why it requires torch distributed
#1266 opened 2 months ago by KookHoiKim
1
[BUG] build multimodal dockerfile problem
#1267 opened 2 months ago by FortuneBush
0
[QUESTION] Effect of sequence parallel with dropout rng context
#1256 opened 2 months ago by sbmaruf
2
[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining
#1263 opened 2 months ago by dhia680
0
[BUG] MoE pre-training does not scale beyond DP dim>8
#1258 opened 2 months ago by hwang595
0
[QUESTION]Using FP8 OOM, otherwise --bf16 works well
#1237 opened 2 months ago by yanchenmochen
0
[QUESTION] Do tp overlap support thd, whose sequence length is flexible?
#1238 opened 2 months ago by wplf
0
[QUESTION]Transformer Engine is totally a shit.
#1239 opened 2 months ago by ZihaoZheng98
0