Parasol is a lightweight distributed computational framework designed for many machine learning problems: SVD, MF(BFGS, sgd, als, cg), LDA, Lasso....
Firstly, parasol split both massive dataset and massive parameter space. Unlike Mapreduce-like systems, parasol give a simple communication model, which allows you to work with a global, distributed key-value storage called parameter server.
In using parasol, you can build algorithms following this rule: 'pull parameters before learning | push parameter's updates after learning'. It is rather a simple model(compared to MPI) and is almost painless from serial to parallel.
Secondly, parasol try to solve 'the last-reducer problem' of iterative tasks. We use bounded staleness and find a sweet spot between 'improve-iter' curve and 'iter-sec' curve. A global scheduler take charge of asynchronous working. This method is already proved to be a generalization of BSP/Pregel by CMU.
Parasol is a Python implementation and originally motivated by Jeff Dean's talk @Stanford in 2013. You can get more details in his paper: "Large Scale Distributed Deep Networks".
Since 'more data is always helpful', you can handle them and get a better performance using parasol.
Have Fun!
You must install ZemoMQ, Mpi4py in advance.
ZeroMQ is a high-performance asynchronous messaging library aimed at use in scalable distributed or concurrent applications.
Mpi4py is a Python package(at PyPI) for the Message Passing Interface (MPI) standard.
$ python setup.py install --prefix=xxx
Parasol only contains limited algorithms till now. (Logistic Regression, Matrix Factorization, Word Count)
By writing your own alg in parasol you must:
I. write a subclass inherits the 'paralg' class.
II. write a entry for your alg.
III. write a json-config-file which must contains "nworker" which refer to number of workers for calculate and "nserver" which refer to number of servers providing parameter service.
IV. run your alg with 'run_parasol.py':
$ ./run_parasol.py --config xxx/alg_cfg.json python xxx/entry.py
Logo for parasol is really cool, you can make it with only one stroke:
(0.5,1) -> (0, 0.5) -> (1,0.5) -> (0.5, 1) -> (0.5, 0.25) -> (0.25, 0.25)
Since Python is slow, I am now rewriting a C++ version which is called Paracel.
If you are using parasol/paracel, let me know.
Any bugs and related problems, just ping me: wuhong@douban.com.