Memory management
Closed this issue · 3 comments
First of all, thanks! I find this project fascinating. My question/issue is about how do you handle the memory for multiple processes. By default Python will create a copy of the data per process. This is prohibitive for large datasets.
How did you manage this problem?
Hi, this problem isn't managed at all. It is your responsibility to run only as many workers as will fit in your memory. For big datasets, this is usually 1.
I don't see this as a problem since if you were to do hyperparameter optimization on a big dataset in Jupyter, you wouldn't run multiple sklearn processes either.
Thanks for your quick answer. I have to note that a tool like this should make an efficient use of the resources, given the computational expense of hyper-parameter search. So, the ideal scenario would be that the data is collectively used by all processes. Unless, the communication cost from node to central node is too expensive, which is not the case for a single node configuration.
Some of the model's implementation in sklearn do not exploit parallelism (I have numpy with OpenBLAS but still they use only one core, please let me know if I am wrong). This is why I find myself building pipelines in parallel and sometimes I can't afford more than one copy of the data in memory.
I will put an example up as soon as I get the time.
You are correct, but this is not really a tool to replace sklearn. If sklearn's implementation does not exploit parallelism, running it in xcessiv will not (unless you write your own algorithms, which xcessiv allows).
Any special memory management handling must play safe with arbitrary code that users run for loading data/fitting estimators,and I'm not sure how feasible that is. There may be cases when an algorithm must modify the data in-place, and that would not play well if other processes were using that data as well.