how many partition should I choose?
pjgao opened this issue · 3 comments
I have 47 G RAM, 24 CPUs, how many partition should I choose in this dataset?
The number of partitions is determined through experiment: you'll have to try different values to see which works the best. That being said, some general principals are:
- The number of partitions should be at least as many as the number of workers.
- Each partition must be small enough to fit in memory on a single worker.
- More partitions means less variability between time to compute each operation.
- The number of partitions should be a multiple of the number of workers.
We found that 104 worked well for our hardware, but there is no guarantee that it's optimal.
As you say, I can choose 24,48,72,96,120,144...partitions, which should take a long time to find the best.
so, Can this still save much time than not using Dask ?
There's not much to be gained from trying all those different values. Once you have the features from one run, then you don't need to re-do the run with a different number of partitions. The only benefit would be trying to find the optimal number of partitions to use for implementing this solution on additional datasets, however, the optimal number of partitions will change on each dataset.
My advice would be to do the calculation once and not worry too much about whether you have used the "best" number of partitions unless you are curious (we'd like to hear what works best if you try this).
In our tests (using an earlier version of Featuretools), the Dask approach shortened the run time from 25 hours to less than 3 hours on a personal laptop. Your results may vary, and Featuretools has gotten much quicker since the benchmarking.