Experimental ranges library for CUDA. Heavily inspired by the C++ ranges.
Ranges GPU is a header-only library implementing ranges for CUDA. Like the C++ ranges it allows composing lazy views into pipelines that can an appropriate time be materialised. The construction of pipelines happens on the CPU, while the actual computation needed to materialise the result are done on the GPU. The goals are to minimise the number of host-device data transfers and kernel launches and, hopefully, end up with code that can be optimised by the compiler to something not much worse than hand-written solutions (that's definitely not the case at the moment though, see TODO below).
Ranges GPU is a header-only library. The default configuration will build unit tests and microbenchmarks.
The build requirements are as follows:
- CMake 3.10
- GCC 7
- CUDA 10
- Google Test (optional)
- Google Benchmark (optional)
Older versions may work but are not tested.
nvidia/cuda
docker container may be a good place to start.
The build instructions are quite usual for a CMake-based project:
cd <build-directory>
cmake -DCMAKE_BUILD_TYPE=<Debug|Release> -G Ninja <source-directory>
ninja
ninja test
The main two concepts in the Ranges GPU library are actions and views. The former are eager, take a GPU range or a view as an input and produce either a GPU or CPU array or a value. The latter are lazy and may either represent a completely new range or in some transformation of the input range or ranges.
Views can be composed together using operator|
to create more complex pipelines.
Only actions cause data transfers and kernel launches. Views are just lightweight description of what needs to be done when they are materialised by an action.
all_of
– the equivalent ofstd::all_of
. Built on top ofaction::reduce
returns true if all elements in the input range satisfy the given predicate.any_of
– the equivalent ofstd::any_of
. Built on top ofaction::reduce
returns true if any element of the input range satisfies the given predicate.none_of
– the equivalent ofstd::none_of
. Built on top ofaction::reduce
returns true if none of the elements in the input range satisfies the given predicate.reduce
– folds the range with the given associative and commutative binary function.to_cpu
– materialises the input range and transforms it to the host memory.to_gpu
– materialises the input range in the GPU memory.
filter
– a view of the input range without the elements that do not satisfy the given predicate.iota
– a view of a sequence of elements.take
– a view containing only N first elements of the input range.to_gpu
– a view of a range in the host memory.transform
– a view of the input range with the given unary function applied to all its elements.zip
– a view of pairs, with the Nth pair containing the Nth elements of the first and the second input range.
auto result = ranges_gpu::view::iota(1, 5)
| ranges_gpu::view::transform([] __host__ __device__(int x) {
return float(x) / 10;
})
| ranges_gpu::action::to_cpu()
;
// result contains: {0.1f, 0.2f, 0.3f, 0.4f}
This example uses view::iota
to create (on the GPU) a range of 4 integers {1, 2, 3, 4}
, transforms it by dividing each of them by 10
and transfers the result to the host memory.
auto vector = std::vector<int>{1, 2, 3, 4, 5, 6, 7, 8, 9};
auto result = vector
| ranges_gpu::view::to_gpu()
| ranges_gpu::view::filter([] __device__(int x) {
return x % 3 == 0;
})
| ranges_gpu::view::transform([] __device__(int x) {
return x / 3;
})
| ranges_gpu::action::to_cpu()
;
// result contains: {1, 2, 3}
The code above transfers a vector to the GPU memory, filters it by removing elements non divisible by 3
, divides the remaining ones by 3
and copies the result back to the CPU.
auto gpu_array = ranges_gpu::view::iota(1, 10)
| ranges_gpu::action::to_gpu()
;
auto result = gpu_array
| ranges_gpu::action::reduce(0, [] __device__(int a, int b) {
return a + b;
})
;
// result is equal 45
In this example, an GPU array of 9 integers ([1, 9]
) is created first. Then, all of them are added to each other using action::reduce
and the result is returned on the host.
auto on_cpu = std::vector<int>{10, 20, 30, 40, 50, 60, 70, 80, 90};
auto result = ranges_gpu::view::zip(
ranges_gpu::view::iota(1, 10),
on_cpu | ranges_gpu::view::to_gpu()
)
| ranges_gpu::view::transform([] __device__(ranges_gpu::pair<int, int> v) {
return v.first + v.second * 2;
})
| ranges_gpu::view::filter([] __device__(int x) {
return x % 2 && x / 10 <= 8;
})
| ranges_gpu::action::to_cpu()
;
// result contains: {21, 63}
The first step here is to create two ranges on the GPU:
- a range of 9 integers
[1, 9]
usingview::iota
- a view of a CPU range
on_cpu
Those two ranges are then zipped together using view::zip
. The result is a range of pairs. Those pairs are then transformed and filtered by view::transform
and view::filter
, respectively. Finally, the resulting range is copied back to the host memory.
auto input = std::vector<int>{37, 44, 68, 74, 92, 4, 34, 97, 14, 27};
auto on_gpu = input
| ranges_gpu::view::to_gpu()
| ranges_gpu::action::to_gpu()
;
float average = on_gpu
| ranges_gpu::action::reduce(0, [] __device__(int a, int b) {
return a + b;
})
;
average /= input.size();
// average is 49.1f
auto result = on_gpu
| ranges_gpu::view::filter([average] __device__(float v) {
return v > average;
})
| ranges_gpu::action::to_cpu()
;
// result contains: {68, 74, 92, 97}
This example consists of three steps:
- An input CPU range is copied to the GPU. This is done separately and not as a part of any other pipeline because we are going to need it more than once.
- The elements of the GPU range are added together in order to get the average value.
- The GPU range is filtered by removing elements that are not greater than the average. The result is copied back to the CPU.
- Performance improvements
- Most algorithms use very naive implementations. There is a lot to improve.
- Consider how allowing the user to do more tuning user will affect the design.
- There may be opportunities for more efficient template specialisations.
- Process data in batches – at the moment all elements of a range are processed at once. Doing so in sufficiently large batches would allow ranges that do not fit completely in the GPU memory (including infinite ranges).
- Add more actions and views:
concat
sort
unique
- support for ranges of ranges would enable
split
andjoin
- More formal design and documentation. In particular, clarify which classes and their members are public interface and which are implementation details.
- Not all views can be composed with views of unknown size (those that use
view::filter
) – fix that. - Ensure good integration with the standard C++ ranges.
- Add support for CUDA streams.
- Add more items to the TODO.