Lace is a C framework for fine-grained fork-join parallelism intended for scientific computations on multi-core computers.
TASK_1(int, fibonacci, int, n) {
if(n < 2) return n;
SPAWN(fibonacci, n-1);
int a = CALL(fibonacci, n-2);
int b = SYNC(fibonacci);
return a+b;
}
int main(int argc, char **argv)
{
int n_workers = 4;
lace_start(n_workers, 0);
int result = RUN(fibonacci, 42);
printf("fibonacci(42) = %d\n", result);
lace_stop();
}
Feature | Description |
---|---|
Low overhead | Lace uses a scalable double-ended queue for its implementation of work-stealing, which is wait-free for the thread spawning tasks and lock-free for the threads stealing tasks. The design of the datastructure minimizes interaction between CPUs. |
Suspending | Lace threads can be manually suspended when the framework is not used to reduce CPU usage. This is used when part of a computation is not parallelized, since Lace workers busy-wait for work. |
Interrupting | Lace threads can be (cooperatively) interrupted to execute another task first. This is for example used by Sylvan to perform garbage collection. |
Lace is licensed with the Apache 2.0 license.
The main repository of Lace is https://github.com/trolando/lace. The main author of Lace is Tom van Dijk who can be reached via tom@tvandijk.nl. Please let us know if you use Lace in your projects and if you need features that are currently not implemented in Lace.
Lace requires a modern compiler supporting C11.
Optionally, Lace can use hwloc (libhwloc-dev
) to pin workers and allocate memory on the correct CPUs/memory domains on NUMA systems.
It is recommended to build Lace in a separate build directory:
mkdir build
cd build
cmake ..
make
Lace is typically used as a subproject, for example using the FetchContent or ExternalProject feature of CMake.
Lace can be configured with the following CMake settings:
Setting | Description |
---|---|
LACE_BUILD_TESTS |
Build the testing programs (not when subproject) |
LACE_BUILD_BENCHMARKS |
Build the included set of benchmark programs (not when subproject) |
LACE_USE_MMAP |
Use mmap to allocate memory instead of posix_memalign |
LACE_USE_HWLOC |
Use the hwloc library to pin threads to CPUs |
LACE_COUNT_TASKS |
Let Lace record the number of executed tasks |
LACE_COUNT_STEALS |
Let Lace count how often tasks were stolen |
LACE_COUNT_SPLITS |
Let Lace count how often the queue split point was moved |
LACE_PIE_TIMES |
Let Lace record precise overhead times |
Ideally, LACE_USE_MMAP
is set to let Lace allocate a large amount of virtual memory for the task queues instead of real memory. Real memory is only allocated by the OS when required, thus in most use cases this minimizes the memory overhead of Lace. If LACE_USE_MMAP
is not set, then real memory is allocated using posix_memalign
, and a more conservative queue size should be chosen when invoking lace_start
.
There are two versions of Lace:
- The standard version
lace
consisting oflace.h
andlace.c
uses 64 bytes per task and supports at most 6 parameters per task. - The extended version
lace14
consisting oflace14.h
andlace14.c
uses 128 bytes per task and supports at most 14 parameters per task.
Start the Lace framework using the lace_start(unsigned int n_workers, size_t dqsize)
method.
This creates n_workers
new threads that will immediately start busy-waiting for work. Each threads will allocate its own task queue for dqsize
tasks. The entire queue is preallocated, requiring 64 bytes per tasks (or 128 bytes for lace14
).
- When
n_workers
is set to 0, Lace automatically detects the maximum number of workers for the system usinglace_get_pu_count()
. - When
dqsize
is set to 0, the default is used, which is currently 100000 tasks.
Use lace_stop()
to stop the framework, terminating all workers.
Lace workers busy-wait for tasks to steal, increasing the CPU load to 100%.
Use lace_suspend
and lace_resume
from non-Lace threads to temporarily stop the work-stealing framework.
Calls to lace_start
, lace_suspend
, and lace_resume
do not incur much overhead.
Suspending and resuming typically requires at most 1-2 ms.
Lace tasks are defined using the TASK_n
macro, where n
is the number of parameters.
For example, TASK_1(int, fib, int, n) { ... }
defines a Lace task with an int return value and one parameter of type int and variable name n.
Declaration and implementation can be separated using the TASK_DECL_n
and TASK_IMPL_n
macros.
To declare tasks with no return value, use the VOID_TASK_n
macros, for example, VOID_TASK_1(do_something, int, n)
.
From Lace tasks (running in a Lace thread):
- Use
SPAWN
to create a task andSYNC
to obtain the result (if stolen) or execute the task (if not stolen) - Use
CALL
to directly execute a task without putting it in the queue - Use
DROP
instead ofSYNC
to not execute a task (unless already stolen)
From external methods (not running in a Lace thread):
- Use
RUN
to offer the task to the Lace framework. This method halts until the task is fully executed
See the benchmarks
directory for examples.
Lace offers two methods to interrupt currently running tasks and run something else:
- the
NEWFRAME
macro, e.g.NEWFRAME(fib, 40)
macro halts current tasks and offers thefib
method to the framework. - the
TOGETHER
macro halts current tasks and lets all Lace workers execute a copy of the given task.
The TOGETHER
macro is useful to initialize thread-local variables on each worker.
Interrupting is cooperative. Lace checks for interrupting tasks when stealing work, i.e., during SYNC
or when idle.
Large tasks can use the YIELD_NEWFRAME()
macro to manually check for interrupting tasks.
Lace offers the lace_barrier
method to let all Lace workers synchronize.
Typically used in Lace tasks created using the TOGETHER
macro.
There is currently no direct support for C++ classes, but class methods can be parallelized via C helper functions.
Lace comes with a number of example programs, which can be used to test the performance of Lace.
Many of these benchmark programs have been obtained from benchmark collections of other frameworks such as Cilk, Wool, and Nowa.
After building Lace with LACE_BUILD_BENCHMARKS
set to ON
, the build/benchmarks
directory contains the benchmarks programs, as well as the bench.py
Python script that runs the benchmarks.
Workloads such as matmul
and queens
are easy to load balance.
The fib
workload has a very high number of nearly empty tasks and is therefore a stress test on the overhead of the framework, but is not very representative for real world workloads.
The uts t3l
is a more challenging workload as it offers a unpredictable tree search.
See for further details the academic publications on Lace mentioned below.
The following two academic publications are directly related to Lace.
T. van Dijk (2016) Sylvan: Multi-core Decision Diagrams. PhD Thesis.
T. van Dijk and J.C. van de Pol (2014) Lace: Non-blocking Split Deque for Work-Stealing. In: Euro-Par 2014: Parallel Processing Workshops. LNCS 8806, Springer.