smartmoe性能问题

Question

Opened this issue a year ago · 0 comments

你好

我在megatron-deepspeed里分别继承了megatron的switch mlp和smartmoe里的megatron-mlp进行对比。

模型采用GPT结构，1.3B大小，两种实现分别设置2个专家实验，未设置专家并行，从模型结构上看没有什么问题。

smart-moe: MegatronMLP

megatron-lm: SwitchMLP

实验结果上同样数据集和batchsize，SwitchMLP要高于MegatronMLP，TFlops分别是10.x 和 8.x。

在论文中你们比较了deepspeed-moe和上一版本的fastmoe，我想请问一下是否有做过和Megatron-LM的moe性能比较