Make partitions according to a specified number of events.
lgray opened this issue · 2 comments
lgray commented
As discussed in 23 Friday Coffea meeting:
- A spark partition per basket is probably not the best way forward
- Similarly a spark partition per file probably has too much skew.
It would be reasonable to implement a partitioning that tries to fill a set number of events (like 200k) per partition, to avoid later skew and start off with a reasonable parallelization of the processing task.
PerilousApricot commented
ty for writing this down so it's on "the list" -- I'll want #47 implemented first before any more config options get added. I don't want the situation where the only way to find out about options and what they do is by reading the source
lgray commented
Here's the code changes the implement splitting by event:
lgray@989e826
I'll keep it up to date with master.