insitro/redun

Scheduler becomes a bottleneck when tasks return large python objects

Opened this issue · 0 comments

Hey there Redun team!

This is a scaling issue that we've run into on my team. When running a pipeline in which multiple tasks return large python objects, the redun scheduler becomes very slow as it unpickles, re-pickles, and stores each large object in the db/S3 one at a time.

Maybe it is possible to move this part of the telemetry collection process to run in the execution node, after the task has completed but before terminating? Then the scheduler would not have to handle the actual large objects itself, instead passing around references.

The current workaround we're using is to save these large objects to Files in S3 and return the File objects. (While we're at it we pick more contextually-specific serialization formats than pickle.) Another possible enhancement to Redun might be to support this pattern more directly, so that users don't have to spend as much code serializing, finding an S3 key, uploading, downloading, deserializing every time.

Happy to provide more details if there is interest.

Cheers!