DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding

Question

DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding

michelwi opened this issue 6 months ago · 8 comments

I observed a suboptimal behavior for the _distribute_evenly_by_size of DistributeFilesDataset when using distrib_shard_files option.

Let the number of equally sized HDF files be 4x7+1 = 29 and the number of GPUs be 8.
Then the avg_size_per_sub_epoch is 29/8 = 3.625.
Now every time 3 Files are added to each bin (because 3 is smaller than the average) and then a 4th file is added because 4 is closer to 3.6 than 3.
This now leaves me with 7 GPUs that each have 4 HDF files and one GPU with just 1 file.

And once the last GPU finishes its single HDF file, all other GPUs stop processing the remaining files and out of the 29 files only 8 have been processed.

Answer 1 · 2024-07-02T14:26:59.000Z

I think you have thought about this much longer than I have now. But if I were to propose some solutions:

use a more complicated (than linear) algorithm to fill the bins
update the avg_size_per_sub_epoch after each bin is filled with the average over the remaining files and bins
extend the implementation to be sharding-aware and make sure the size variations between GPU shards are as small as possible while allowing for larger differences between different sub-epochs

Answer 2 · 2024-07-02T16:50:20.000Z

What is your partition epoch? Let's assume it is 1 here.

You are right, the current algorithm is really not optimal for this case. I think it should be changed. It's a bit unfortunate that we need to change it now, as there are already a couple of existing setups out there, and this now would change their behavior. Or we make it an option, but then we need to maintain two separate implementations, and it would also be ugly. Maybe it's still ok to do such a change now, as there are maybe not so many existing setups yet.

Btw, check test_DistributeFilesDataset_distribute_evenly_by_size where I test a couple of more extreme cases. We should add this case you described there. But also, when changing the algo, making sure that the other examples there don't end up in very bad distributions.

use a more complicated (than linear) algorithm to fill the bins

I'm not sure this is really necessary.

But also, I wanted to avoid to change the order of the files again here, because that would add some bias to certain file orderings, and then the file order shuffling is not really random anymore.

And also, if it's not totally optimal, I thought it's fine anyway. The file caching is maybe not optimal, and we might see some data more often than other data (although this is totally random).

In any case, do you have good suggestion? Note that our number of files is usually much larger (couple of thousands), and this dataset is intended for the really large scale case, i.e. even millions of files should not be a problem. I was anyway thinking that maybe we should make our files smaller, such that the distribution becomes simpler. And I think this problem belongs to the class of bin packing problems which is usually NP-complete, i.e. an optimal solution is not feasible. I guess there must be some existing solutions to exactly this problem, as some other people likely have run into the same problem. E.g. this question, or this.

update the avg_size_per_sub_epoch after each bin is filled with the average over the remaining files and bins

I prefer this variant. I think I also thought at some point about this. Not sure why I did not do this. This should be quite simple and solve the problem, right?

extend the implementation to be sharding-aware and make sure the size variations between GPU shards are as small as possible while allowing for larger differences between different sub-epochs

I'm not sure that there is really a point in doing this. In your example with partition epoch 1, the only distribution is anyway over the shards (GPUs), not over the sub epochs. So we get back to the problem that we need an algorithm which evenly distributes the files over a number of bins (here: shards / GPUs). That's exactly what _distribute_evenly_by_size is trying to do (without reordering the files).

Answer 3 · 2024-07-03T09:40:55.000Z

What is your partition epoch? Let's assume it is 1 here.

yes, I just wanted to make the example easier. In the real case I had selected a probably too large value for partition epoch while I was tuning.

And I think this problem belongs to the class of bin packing problems which is usually NP-complete
[...]
I prefer [update the avg_size_per_sub_epoch after each bin is filled with the average over the remaining files and bins]. I think I also thought at some point about this. Not sure why I did not do this. This should be quite simple and solve the problem, right?

yes, agreed to everything.

Answer 4 · 2024-07-03T13:06:52.000Z

@NeoLegends your thoughts? Will you work on this or should I?

Answer 5 · 2024-07-03T13:14:54.000Z

I don't have much to add right now, except for that I'll try and take inspiration from other heuristic binpacking libraries one can find to see if our behavior in extreme cases can be improved. I also think that having to keep the files in order is quite limiting for us (especially if we want to stay linear), but I agree it's important for randomization.

I'll work on it!

Answer 6 · 2024-07-03T13:18:42.000Z

I don't have much to add right now, except for that I'll try and take inspiration from other heuristic binpacking libraries one can find to see if our behavior in extreme cases can be improved.

Why? What about the simple approach suggested here? It's very simple to do. Doesn't this work already? If this works, I think this is already fine. Check if test_DistributeFilesDataset_distribute_evenly_by_size is still reasonable, and add maybe some more there, including the example by @michelwi here.

The algorithm should not become more complicated. Also not more costly. Also reorderings must not be allowed.

Answer 7 · 2024-07-03T13:19:45.000Z

the simple approach is the first thing I'll implement

Answer 8 · 2024-07-05T12:36:55.000Z

I think this can now be closed.