Issues
- 0
[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?
#1333 opened by IgorZan - 0
[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing
#1330 opened by thuwzt - 0
[BUG]megatron-lm,with torchompile,The provided qkv memory layout is not supported!
#1329 opened by qingshanxwx - 0
- 2
[QUESTION] recompute activation rather consume more memory while backward (OOM)
#1300 opened by KookHoiKim - 0
- 0
[QUESTION] About using StreamingLLM
#1326 opened by zhangyilalala - 1
[QUESTION] scaleing MFU calculate
#1276 opened by ltm920716 - 2
- 0
[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss
#1324 opened by cailun01 - 2
- 0
- 2
[BUG] validate_yaml() isn't in sync with arguments check
#1297 opened by pierric - 1
[QUESTION] Why is the initialization of the router and experts different in the MoE part?
#1302 opened by mxymxy77 - 0
[QUESTION] I encountered the following issue when executing your command. What could be the cause? args.exit_on_missing_checkpoint is: True >> '--exit-on-missing-checkpoint' set ... exiting. <<
#1317 opened by Alinanini - 1
[QUESTION] How to convert torch_dist format checkpoint to torch format?
#1291 opened by zhangyilalala - 0
- 0
[BUG] When using LLaVA with freeze-LM, training text only sample occurs error.
#1314 opened by liveseongho - 0
[QUESTION] Gradient Propagation in backward pass
#1312 opened by arul-lm - 0
[QUESTION]UnboundLocalError:local variable ‘output tensor’ referenced before assignmnet
#1311 opened by zmtttt - 0
[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error.
#1310 opened by bphwk - 0
[QUESTION]
#1308 opened by eliird - 0
[BUG] The problem of splitting transformer layers when pipeline parallelism cannot be evenly divided.
#1304 opened by Baibaifan - 1
[QUESTION] How to split the Transform layer when the pipeline is uneven?
#1303 opened by renyinCheng001 - 6
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel
#1292 opened by SeunghyunSEO - 0
- 2
[BUG] The cached_loss_mask cannot be consistent
#1298 opened by XLzed - 0
[BUG] Segmentation fault: address not mapped to object at address (nil) while use recompute granularity option
#1299 opened by KookHoiKim - 0
[BUG] LLaVA may fail with EPP0 PP>1
#1293 opened by lostkevin - 0
[QUESTION] deepseek v2 compatility?
#1295 opened by wavy-jung - 12
[BUG] training crash when set --tp-comm-overlap
#1274 opened by ltm920716 - 1
[BUG] Encountering NaN gradients when using CUDA Graph
#1279 opened by DXZDXZ - 3
[QUESTION] NVIDIA Megatron Core 0.9.0 does not have shared_experts.py
#1257 opened by clarence-lee-sheng - 0
[QUESTION] SGD support in distrib_optimizer.py
#1287 opened by zstreeter - 1
[BUG]Megatron-LM doesn't support transformer-engine 1.13
#1280 opened by klhhhhh - 0
[QUESTION] There is already a 32-bit model parameter in the optimizer state. Why do we need to store a separate copy of the model parameters in the checkpoint?
#1283 opened by leondada - 0
Where can I download the tokenizer for the model mcore-llava-mistral-7b-instruct-clip336-pretraining?
#1281 opened by herolxl - 0
[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?
#1277 opened by Louis-J - 0
[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` .
#1275 opened by wplf - 0
- 1
[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed
#1259 opened by efsotr - 0
[QUESTION] How to Visualize Computational Graph
#1272 opened by zixianwang2022 - 1
- 0
[BUG] build multimodal dockerfile problem
#1267 opened by FortuneBush - 2
- 0
[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining
#1263 opened by dhia680 - 0
[BUG] MoE pre-training does not scale beyond DP dim>8
#1258 opened by hwang595 - 0
- 0
- 0
[QUESTION]Transformer Engine is totally a shit.
#1239 opened by ZihaoZheng98