openxla/xla

[Feature Request]Add more comm op support on gpu_hlo_cost_analysis

Opened this issue · 4 comments

such as allgather/reducescatter/alltoall, which is commonly usely on fsdp/moe model
my suggestion:

  1. decoupling output_bytes_accessed and bytes_accessed,which bytes_accessed shoud be operand size bytes. not put them together
  2. add allgather/reducescatter/alltoall support
  3. support upper op on gpu_collective_performance_model.cc;

we can make a table to show how many bytes will send inter nodes and inner nodes

I have prepared a PR for them, is this Feature Request needed?

additional question:
Can I get which gpu is inner node gpu? use nvml api?

Could you please add more details why the current model doesn't work for comm ops?

decoupling output_bytes_accessed and bytes_accessed,which bytes_accessed shoud be operand size bytes. not put them together

There is also operand_bytes_accessed. And bytes_accessed is just sum(operand_bytes_accessed) + output_bytes_accessed.

We don't use bytes_accessed in gpu_performance_model.cc, but I see uses in other parts of the codebase and I can't predict all the implications if we change semantics of the field.

@olegshyshkov OK,I understand~
so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this

@olegshyshkov OK,I understand~ so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this

Yes, having more support for those ops will be good if you can add it.