Why not support tensor model parallel?

Question

Why not support tensor model parallel?

Closed this issue 10 months ago · 7 comments

After looking at the code, neither moe nor dmoe support tensor-model-parallel.
@tgale96

Answer 1 · 2023-11-21T12:10:22.000Z

Does args.moe_weight_parallelism represent tensor-model-parallel？

Answer 2 · 2023-11-21T14:26:34.000Z

Hi! The weight parallelism argument turns on sharded data parallelism. If you set the expert parallelism arguments such that there is <1 expert per device we'll use tensor parallelism on top of expert parallelism. I hope this helps! Let me know if there are other features you're looking for!

Answer 3 · 2023-11-22T01:58:54.000Z

@tgale96
Thanks for your reply, I understand.
Does the current code now support the logic of using tensor parallelism when there is <1 expert per device?

Answer 4 · 2023-11-22T04:43:41.000Z

The expert_sharding_degree shard the tensor when experts is less than expert_parallel_world_size. Therefore, the current code seems to support tensor parallelism.
Is my understanding correct?
@tgale96

Answer 5 · 2023-11-22T05:25:05.000Z

If I don't want to use expert parallelism and instead want to directly use tensor parallelism, it should be theoretically possible, even though the current code doesn't support it, right?

Answer 6 · 2023-11-22T14:48:46.000Z

The sharding when expert sharding degree is less than expert parallel world size is expert model parallel sharding. If you don't want to use expert model parallelism and want to use tensor parallelism that would be a feature we'd have to implement separately. There is no theoretical limitation to having something like Megatron-style tensor model parallelism.

Answer 7 · 2023-11-23T01:29:37.000Z

Thanks for your reply