[Feature Request]Add more comm op support on gpu_hlo_cost_analysis
Opened this issue · 4 comments
such as allgather/reducescatter/alltoall, which is commonly usely on fsdp/moe model
my suggestion:
- decoupling
output_bytes_accessed
andbytes_accessed
,whichbytes_accessed
shoud be operand size bytes. not put them together - add allgather/reducescatter/alltoall support
- support upper op on
gpu_collective_performance_model.cc
;
we can make a table to show how many bytes will send inter nodes and inner nodes
I have prepared a PR for them, is this Feature Request needed?
additional question:
Can I get which gpu is inner node gpu? use nvml api?
Could you please add more details why the current model doesn't work for comm ops?
decoupling output_bytes_accessed and bytes_accessed,which bytes_accessed shoud be operand size bytes. not put them together
There is also operand_bytes_accessed
. And bytes_accessed
is just sum(operand_bytes_accessed) + output_bytes_accessed
.
We don't use bytes_accessed
in gpu_performance_model.cc, but I see uses in other parts of the codebase and I can't predict all the implications if we change semantics of the field.
@olegshyshkov OK,I understand~
so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this
@olegshyshkov OK,I understand~ so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this
Yes, having more support for those ops will be good if you can add it.