spark-root/laurelin

Make partitions according to a specified number of events.

lgray opened this issue · 2 comments

lgray commented

As discussed in 23 Friday Coffea meeting:

  • A spark partition per basket is probably not the best way forward
  • Similarly a spark partition per file probably has too much skew.

It would be reasonable to implement a partitioning that tries to fill a set number of events (like 200k) per partition, to avoid later skew and start off with a reasonable parallelization of the processing task.

ty for writing this down so it's on "the list" -- I'll want #47 implemented first before any more config options get added. I don't want the situation where the only way to find out about options and what they do is by reading the source

lgray commented

Here's the code changes the implement splitting by event:
lgray@989e826

I'll keep it up to date with master.