alibaba/Pai-Megatron-Patch
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
PythonApache-2.0
Issues
- 2
有适配qwen2-vl的打算吗?
#339 opened by divisionblur - 5
建议对deepseek-v2-coder-lite进行sft测试
#342 opened by bao-xiaoyi - 0
AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3
#366 opened by lanfengmo - 0
Possible bug in Mistral MCore <->HF Model conversions because of _extra_state layers
#363 opened by abgoswam - 3
llava run error
#330 opened by yangzhipeng1108 - 1
关于LLAMA 3.1模型的适配问题
#361 opened by echo-valor - 0
对qwen-2.5扩充词表后loss飙升
#360 opened by QianguoS - 1
cannot import name 'TEDotProductAttentionMLA' when running `examples/deepseek_v2/run_mcore_deepseek.sh`
#359 opened by dreasysnail - 0
No module named 'megatron'
#357 opened by yuanzhiyong1999 - 4
打扰了,提个关于多机训练的issues
#307 opened by CallmeZhangChenchen - 2
- 1
DeepSeek Vocab-size Mismatch
#338 opened by Jiayi-Pan - 7
加入群聊失败, 第二个群也不能扫码加入了
#351 opened by GeorgeSen - 1
qwen2.5转换脚本转换时报错
#354 opened by enze5088 - 3
- 1
[[: not found Zarr-based strategies will not be registered because of missing packages Traceback (most recent call last)
#346 opened by aJupyter - 1
optimizer offload
#352 opened by leo-ztjht - 0
在转换模型的时候就报了一些bug
#350 opened by Yanhong-Li - 0
llama3.1 8b训练32k的上下文模型,训练时间长、并且loss偏大
#348 opened by ARQlalala - 1
llama3.1支持多数据集混合预训练
#347 opened by Bob199511 - 0
有适配minicpm的打算吗?
#345 opened by adol001 - 2
llama7b OOM问题
#343 opened by mxjmtxrm - 4
qwen2-sft 训练起步阶段就卡住
#325 opened by baisechundu - 2
关于llava适配的问题
#333 opened by divisionblur - 0
AssertionError: First dimension of the tensor should be divisible by tensor parallel size
#332 opened by pizts - 7
deepseek模型转换问题
#327 opened by bao-xiaoyi - 2
TypeError: get_cpu_offload_context() missing 1 required positional argument: 'weight_offloading'
#324 opened by ben-8878 - 2
关于使用idxmap格式finetune qwen2
#319 opened by Gloid59 - 1
Qwen2 0.5B 和 1.5B的模型是否应该将这个参数去掉?
#296 opened by MrWaterZhou - 2
OSError: [Errno 28] No space left on device 请教
#302 opened by shyzzz521 - 3
Mcore是不支持pp吗?
#312 opened by divisionblur - 3
starcoder依赖哪个版本的megatron-lm?
#314 opened by bao-xiaoyi - 1
Channel Loss支持
#316 opened by echo-valor - 1
断点续训问题
#318 opened by divisionblur - 1
mmap数据格式问题
#320 opened by bao-xiaoyi - 1
安装pyarrow失败
#321 opened by xiaoquanWu - 2
mcore 权重转换不支持pp>1
#322 opened by xs1997zju - 1
使用flash-attn训练Qwen1.5 1.8B 加速效果不明显
#323 opened by coder-wangzhen - 1
- 3
QwenTokenizer与Qwen2Tokenizer
#295 opened by sexan - 0
保存的checkpoints中缺少distrib_optim.pt
#315 opened by shizikachen - 5
钉钉群满了
#304 opened by divisionblur - 3
seq len开大时,初始loss不正常
#300 opened by Jayce1kk - 1
是否支持sharegpt格式数据?或者带"history"字段的多轮对话数据?
#306 opened by jiejie1993 - 1
Flash-Attn 3的支持
#308 opened by echo-valor - 1
optimizer offloading 太强了
#311 opened by 154912369 - 5
Missing key(s) in state_dict llama3 mcore转换后权重不匹配
#303 opened by wuduher - 2
bigcode-evaluation-harness 这个仓库应该是没有了
#301 opened by CallmeZhangChenchen - 0
[rank31]: OSError: error stat()ing file 数据集map问题
#305 opened by shyzzz521 - 1
nvcr.io/nvidia/pytorch:23.12-py3镜像包冲突
#294 opened by wuduher