pyParade

PyParade is a lightweight multiprocessing framework for Python that makes it easy to parallel process large-scale datasets efficiently. To install pyParade, simply run:

pip install pyparade

Automatic parallelizing

Similar to SPARK you define how your data should be processed on a high level and pyParade does the parallelizing automatically for you. With pyParade, you don't need to explicitly delegate the work to multiple threads/processes, but you still benefit from the full CPU power available on your machine.

Create your processing chain

You model how the result is calculated by applying an operation to a dataset. The result is a new dataset that you can apply new operations on:

dataset = pyprade.Dataset(range(0,1000))
result = dataset.map(f).group_by_key().map(g).collect()

Watch the progress

By default pyParade will display detailed status information about the running process on console. It will tell you how much data is already processed and will calculate estimated completion times.

Parallel Process
================================================================================
Numbers (buffer: 725420)                                                 1000000
 calculate                       20.129000%  ETA 2018-03-24 17:55 201290/1000000
Key/Value pairs (buffer: 4846)                                                  
 take value                                                         71790/137591
Values (buffer: 0)                                                              
 sum                                                                   748/30831
Sum (result)                                                                   0

Low memory use

Instead of first loading the complete input data into memory and then processing it, pyParade always only loads small portions of data into memory and waits until it is processed. This is implemented by maintaining a buffer between every processing step, similar to a production chain in a factory.

Access databases using context

In pyParade worker processes that are executing an operation can have a context. When running a map operation for example you can provide a function that returns a contextmanager (see Python docs) that is executed for each worker process that is spawned. Using this you can for example start up a database connection when a worker spawns and close it when the worker is stopped. Using that you only maintain exactly one permanent database connection per worker process.

For example

result = dataset.map(f, context = get_db_connection)