/para

Primary LanguageRust

Para

This project is a work in progress. Feel free to contribute.

The goal

Efficient parallelization of pipelined computations on multiple cores with a simple interface.

The interface should look something like:

let data = vec!(0, 1, 2);
let mut sum = 0;
pipeline!(4, data
          => (|x| (x * 2, x) => |(x2, x)| x2 - x,
              |x| -x)
          => |x| sum += x);

In this example, the tasks to be done are:

  • Iterate over the elements in data
  • Clone each element and pass it to both the |x| (x * 2, x) and |x| -x closures.
  • Apply the output of |x| (x * 2, x) to |(x2, x)| x2 - x
  • Sum all outputs of |(x2, x)| x2 - x and |x| -x into the sum variable.

This constructs a graph in which each node is a closure. Data flows between the closures and gets processed. Except for the first and the last nodes in this example (the iteration and the sum nodes), all nodes are completely parallelizable. The pipeline! macro will instantiate a scheduler which will use 4 threads to run all nodes and maximize computation throughput.

Current state

See the integration tests

TODO

Potential bugs

  • Make sure usage of atomics is correct

Interface

  • Support fanout in macro
  • Support multiple producers in macro
  • Support thread num in macro
  • Make it easier to pass producers to schedule() (why not consume? maybe if we want to stop before the end?)
  • Support adding priority for nodes
  • Support marking nodes as running on external HW (e.g. GPU)
  • Support stateless producers? Rayon-style splittable iterators?
  • Implement Fanout for any iterable of consumers

Optimization

  • When a local thread is overflown, dump half of tasks to global queue
  • Make circus sized with power of two, so instead of using modulus with can use & with a mask
  • Use a custom allocator for the tasks
  • Inline some funcs or enable automatic inlining
  • A LOT of time is spent on creating tasks and moving them around. If we could make them not dyn (enum?) we could save both the dynamic dispatch and allocation.
  • Circus reads and writes take time. We could probably optimize that.
  • Designate cores for threads
  • When running short tasks, threads spend significant time synchronizing pushing and popping from the task queue. The usual solution for that would be to implement work stealing.
  • Stateful consumers are held inside a mutex, so threads might spend time blocking. A possible improvement is to modify the consume interface to return a boolean and use try_lock when executing functions in a mutex. This might be less problematic once we implement work stealing. Also, we might wanna use a refcell instead of a mutex.
  • Tasks contain a pointer to a node, which is wasteful as it is has 64 bits while we usually have around 3 bits of nodes. We could use an id instead of reference, or collect tasks per-node. Collecting tasks per node might allow us to further optimize running of stateful tasks.
  • We want all cores to always work. There could be a situation in which there are many tasks, but all are for a specific Stateful node, so they can't be parallelized. For that reason, stateful nodes should always be prioritized.
  • Smartly manage the balance between having many ready tasks and memory usage, by knowing which nodes produce more work and which nodes consume more work, and prioritizing them according to current workload.

Testing

  • Performance regression tests
  • CI/CD
  • Add benchmark that can be optimized using work-stealing
  • Benchmark two producers