[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?
Opened this issue · 0 comments
Louis-J commented
Your question
Ask a clear and concise question about Megatron-LM.
There is an assert in megatron/core/transformer/transformer_config.py: 401
if self.moe_expert_capacity_factor is not None:
** if self.moe_token_dispatcher_type not in ["alltoall", "alltoall_seq"]:**
raise ValueError(
'moe_expert_capacity_factor only works with alltoall token dispatcher'
)
The code to process with capacity_factor and pad in router.py seems it won't change the output tensor's dimsize. And I don't see any different process to do with capacity_factor in token_dispatcher.py. So why should I use only 'alltoall' or 'alltoall_seq' with moe_expert_capacity_factor?
thanks for reply.