tcbrindle/flux

Using Flux to adapt a circular buffer : bug

pfeatherstone opened this issue · 7 comments

I am trying to use Flux to simplify dot products with circular buffers.
I have an implementation that manually uses STL on two sub-ranges. This is really fast, but you have to do some manual book-keeping and if you care about more than one STL algorithm (inner_product in this case), then you have to reimplement everything yourself on the two sub-ranges.

I want to use Flux to define a sequence adaptor, and since a flux sequence defines iterators, I can in theory use that with std::inner_product to compute my dot products.

However, I'm having trouble adapting the container.
Here is a short link https://godbolt.org/z/3PKeT5Wee that better explains my intention.
Thank you

Hi @pfeatherstone, thanks for your interest in Flux! The good news is that there isn't actually a Flux bug here :)

As a C++98 algorithm, std::inner_product expects a (begin, end) iterator pair as its first two arguments. However, as mentioned in the documentation, for a non-bounded_sequence Flux instead returns a C++20 sentinel from its end() function. (If you think about it, this makes sense: if Flux doesn't know where the end of the sequence is, it can't produce an iterator to that position in constant time.)

There are several ways we can fix this:

  • The C++20 versions of the classic STL algorithms, in namespace std::ranges, have been updated to accept iterator/sentinel pairs rather than classic iterator/iterator pairs. Unfortunately, there is no std::ranges::inner_product so that's not an option in this case (although I do have another small library that offers it)

  • We could use the flux::cache_last adaptor on the p::circular_buffer to turn it into a bounded_sequence. However, that would be doing more work than we really need to in this case.

  • We could use the C++20 std::views::common range adaptor to transform the range into a std::ranges::common_range, that is, one where begin() and end() return the same type so we can call the legacy algorithm. This is exactly what views::common is for, but it has a small amount of overhead and it so happens that we can avoid in this case.

  • Since the circular_buffer class has the ability to produce the end position in constant time, the best approach is to turn it into a bounded_sequence by adding a one-line implementation of last() to its flux_sequence_traits. If we do that, then everything compiles correctly: https://godbolt.org/z/aa7194ej7

Unfortunately, as you can see, the runtime of the Flux version is not as good as the custom circular_buffer::dot_product() implementation. From what I can tell, this is because of the way the custom version uses its special knowledge of the circular_buffer internals. It makes two calls to std::inner_product, each of which uses a simple linear data access pattern. Because of this simple access pattern, the compiler is able to auto-vectorise both calls to inner_product, resulting in excellent run time.

For the Flux version, there is only a single call to std::inner_product over the whole sequence, but the access pattern (using v.buf[(cur + v.offset) % v.size()]) seems to be too complicated for the auto-vectoriser to see through, so we don't get any SIMD and things run slower.

This is just a consequence of the custom circular_buffer::dot_product() having extra knowledge about the implementation and the compiler being able to optimise using that knowledge. If you were to add a custom circular_buffer::iterator implementation and made a single call to std::inner_product with it, you'd very likely get the same runtime as the version using Flux.

I hope this helps! I'm going to close this as it's not a Flux bug, but feel free to comment further if you have any questions, or file other issues if you come across any other problems :)

Thank you for the great response! Everything you wrote makes sense and I didn't know about the extra customisation point in the flux traits.

Out of interest, do you intend on adding numeric algorithms to flux like reductions, such as inner_product?

Also, is there a way to adapt a container as the concatenation of two contiguous containers?

And if so, could flux have overloads of numeric algorithms for concatenated contiguous sequences such that they could make use of optimisations similar to what I did in dot_product ?

Also, I have to say, flux is SO MUCH NICER than c++ iterators, STL algorithms and ranges! Adapting containers in flux is approximately 42000 times easier than writing custom iterators or range adapters

Thank you for the great response! Everything you wrote makes sense and I didn't know about the extra customisation point in the flux traits.

Out of interest, do you intend on adding numeric algorithms to flux like reductions, such as inner_product?

We already have quite a few such algorithms, although we don't have an inner_product with the same default operations as the std version. However, you can use flux::zip_fold() with a lambda:

std::vector<int> vec1 = get_vector1();
std::vector<int> vec2 = get_vector2();

auto fn = [](int sum, int x, int y) { return sum + (x * y); };

auto result = flux::zip_fold(fn, 0, vec1, vec2);

Also, is there a way to adapt a container as the concatenation of two contiguous containers?

Yes. The flux::chain adaptor takes an arbitrary number of sequences and "glues" them together as one logical sequence. For example:

std::vector vec1{4, 5, 6};
std::vector vec2{1, 2, 3};

auto chained = flux::chain(std::move(vec2), std::move(vec1));

chained.write_to(std::cout); // prints [1, 2, 3, 4, 5, 6]

In general, the chained adaptor uses the highest common sequence category of all of its parts (except that it can never be contiguous, for obvious reasons). In this case the chained sequence is random-access.

And if so, could flux have overloads of numeric algorithms for concatenated contiguous sequences such that they could make use of optimisations similar to what I did in dot_product ?

For the (majority of) algorithms which take a single sequence and iterate over it in order from start to finish, the chain adaptor already does this. For example, something like

// using vec1 and vec2 from above
auto total = flux::chain(flux::ref(vec1), flux::ref(vec2)).sum();

will effectively call the sum() algorithm on vec1 followed by vec2 and then add the results together. This leads to excellent code generation in a lot of cases. (The key to this is specialising the for_each_while customisation point to allow internal iteration.)

For algorithms like dot_product that operate on multiple sequences however, things are much much trickier to implement generically, as we don't know which is the "special" sequence that we want to dispatch to -- and what happens if both sequences are "special"?

I haven't been able to come up with a good way of doing this for multiple sequences, though it would certainly be nice if it turns out to be possible. In the mean time, for something like dot_product I think your approach of using a custom member function is probably going to offer the best performance.

Also, I have to say, flux is SO MUCH NICER than c++ iterators, STL algorithms and ranges! Adapting containers in flux is approximately 42000 times easier than writing custom iterators or range adapters

Thank you very much, this was a major goal when I was designing the library so I'm delighted to hear it was successful!

Thank you again. I added your zip_fold version, a version using rangeV3 and compared with clang. Yeah, I think a custom algorithm is best. Sad. However, I think flux is faster than ranges.
https://godbolt.org/z/MKx69n7br

Edit:
Running this on my machine I get:

STL      dot 838300, computed in 17.433369 ms
STL     rdot 671650, computed in 17.683371 ms
Flux     dot 838300, computed in 47.490080 ms
Flux    rdot 671650, computed in 47.387517 ms
Flux     dot 838300, computed in 50.761227 ms
RangeV3  dot 838300, computed in 48.475297 ms
RangeV3 rdot 671650, computed in 93.008346 ms

So maybe flux and ranges have similar performance.

I have good news!

I had a look at your link and saw that you changed the implementation of read_at() for circular_buffer to try to simplify the access pattern. This gave me an idea: by removing the conditional branch and using a multiplication instead, it looks like the compiler is able to recognise what we want to do and vectorise the Flux version too, giving a run-time which is on par with the custom dot_product

EDIT: Sadly not, I made a mistake and it doesn't actually work :(

@tcbrindle Awesome! Though it looks like you need gcc 12 onward. gcc 11 didn't seem to optimize as well. Compilers are magic