microsoft/mscclpp

[Feature] Enhance Python benchmark

Binyang2014 opened this issue · 0 comments

  • For allreduce4, make pipeline factor as an input
  • Provide proxy version for localReduceScatter and localAllGather
  • Provide a tuner in python. which can go through the predefined configuration to find the best #blocks/#threads for CUDA kernel
  • Put python benchmark into CI/CD pipeline
  • Move other communication primitives into python benchmark and retire the C++ version