marsupialtail/quokka

reduce initialization overhead - meta thread

marsupialtail opened this issue · 5 comments

Quokka currently has very high initialization overhead for launching input reader actors.

This should go away when we migrate towards an architecture where we only have one actor per machine, but before that happens it would be good to have some sort of way to reduce this initialization overhead.

This is similar to Spark's optimization of moving input partitioning to parallel across the workers instead of on the coordinator in the olden days.

The problem is mainly two fold:

  • Most of the initialization is done in Python and serially, and might involve network calls, e.g. instantiating a pyarrow parquet S3 dataset and dividing up the work.
  • Launching actors in Ray is expensive.

Our architecture change should take care of the second one but won't solve the first one. We should probably work on optimizing parallelizing this process, and have the actual worker nodes, who should be alive when all this is happening anyways, to do a lot of this work, and communicate the results back to the master with ray object store or something.

That said Quokka is still useful for jobs that are very long. For jobs that are very long this init overhead (< 1 min) becomes insignificant.

E.g. this is executing locally on a Parquet dataset that has ~500MB:
Parquet dataset at /home/ziheng/tpc-h/lineitem.parquet has total 6001215 rows
actor spin up took 4.991477727890015
init time 7.50970721244812
run time 1.1990654468536377

Initialization time has drastically improved. However still need to make sure we don't relaunch TaskManagers every time a new TaskGraph is instantiated -- TaskManagers should be tied with a QuokkaContext, not a TaskGraph.

This just got much more important due to way Quokka executes multi-stage queries....