[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?

Question

[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?

Opened this issue 2 months ago · 0 comments

Your question
Ask a clear and concise question about Megatron-LM.

There is an assert in megatron/core/transformer/transformer_config.py: 401

        if self.moe_expert_capacity_factor is not None:
           ** if self.moe_token_dispatcher_type not in ["alltoall", "alltoall_seq"]:**
                raise ValueError(
                    'moe_expert_capacity_factor only works with alltoall token dispatcher'
                )

The code to process with capacity_factor and pad in router.py seems it won't change the output tensor's dimsize. And I don't see any different process to do with capacity_factor in token_dispatcher.py. So why should I use only 'alltoall' or 'alltoall_seq' with moe_expert_capacity_factor?

thanks for reply.