kaushikcfd/feinsum

Add parallel tuning capability

Opened this issue · 5 comments

Add parallel tuning capability

Questions that need to be answered:

  1. What to use for the distributed executed?
    • My vote is for mpi4py, proven to be quite stable over the years.
  2. Should we make the distributed package a hard dep?
    • My vote is "No"
  3. How could the current implementation in feinsum changed?
    • Option 1. We could rewrite feinsum.tuning.OpentunerTuner to behave like a server-client model where only rank does the writes to the database and generate inputs for the search exploration. I am slightly worried about the scalability, but the common case is that evaluating a point in the search space costs us ~10 seconds maybe this is not a big problem.
    • Option 2. Each rank runs its own search-space exploration and after each run, it broadcasts the timing result for other ranks to add as extra seed configurations. I'm not sure if opentuner allows us to seed the configurations during the run and the database update costs when multiple processes are performing the updates.

I have looked at the mpi4py task pool, mpipool, schwimmbad and charm4py task pools. The mpi4py task pool makes the middle two fairly redundant IMO. Between charm4py and mpi4py, mpi4py is easier to build and install, but certain pool executors break/hang on Spectrum MPI. Charm4py pools seemed more stable, but it takes more effort to build charm4py. Most of the basic task execution functionality is similar between the two so it wouldn't be too hard to add both.

A third option might be to divide up the search space in some way and then have each rank search within its subspace. This could introduce load balancing problems, however.

Keeping it as a soft dependency seems fine for now.

Most of the basic task execution functionality is similar between the two so it wouldn't be too hard to add both.

I don't think we need a task pool for any of the options here. Option 1 can be done with simple MPI communication primitive by using MPI_ANY_SOURCE for the server rank and all the client ranks know that rank 0 is the server rank. And the server rank would be in a loop as:

while True:
    Blocking RECV from any source.
    Record the result in a database and send the next iteration point to the rank that was completed.

Let me know if I'm missing something.

Option (3) is also an interesting option, a mild concern there is if that might force us with a sub-optimal search-exploration strategy.

Ray may (or may not) be useful here https://docs.ray.io/en/latest/index.html

Thanks,
I looked at the example in https://docs.ray.io/en/latest/tune/index.html and that should blend well with feinsum's description of parameter spaces. Some downsides I see are:

  1. it seems like a heavyweight library that would pull in lots of dependencies
  2. Seems like the distributed tuning depends on Kubernetes infrastructure, which again is a bit heavyweight.