databricks/megablocks

Why not support tensor model parallel?

Closed this issue · 7 comments

After looking at the code, neither moe nor dmoe support tensor-model-parallel.
@tgale96

Does args.moe_weight_parallelism represent tensor-model-parallel?

Hi! The weight parallelism argument turns on sharded data parallelism. If you set the expert parallelism arguments such that there is <1 expert per device we'll use tensor parallelism on top of expert parallelism. I hope this helps! Let me know if there are other features you're looking for!

@tgale96
Thanks for your reply, I understand.
Does the current code now support the logic of using tensor parallelism when there is <1 expert per device?

The expert_sharding_degree shard the tensor when experts is less than expert_parallel_world_size. Therefore, the current code seems to support tensor parallelism.
Is my understanding correct?
@tgale96

If I don't want to use expert parallelism and instead want to directly use tensor parallelism, it should be theoretically possible, even though the current code doesn't support it, right?

The sharding when expert sharding degree is less than expert parallel world size is expert model parallel sharding. If you don't want to use expert model parallelism and want to use tensor parallelism that would be a feature we'd have to implement separately. There is no theoretical limitation to having something like Megatron-style tensor model parallelism.

Thanks for your reply