[Feature] Enhance Python benchmark
Binyang2014 opened this issue · 0 comments
Binyang2014 commented
- For allreduce4, make pipeline factor as an input
- Provide proxy version for localReduceScatter and localAllGather
- Provide a tuner in python. which can go through the predefined configuration to find the best #blocks/#threads for CUDA kernel
- Put python benchmark into CI/CD pipeline
- Move other communication primitives into python benchmark and retire the C++ version