Using Flux to adapt a circular buffer : bug
pfeatherstone opened this issue · 7 comments
I am trying to use Flux to simplify dot products with circular buffers.
I have an implementation that manually uses STL on two sub-ranges. This is really fast, but you have to do some manual book-keeping and if you care about more than one STL algorithm (inner_product
in this case), then you have to reimplement everything yourself on the two sub-ranges.
I want to use Flux to define a sequence adaptor, and since a flux sequence defines iterators, I can in theory use that with std::inner_product
to compute my dot products.
However, I'm having trouble adapting the container.
Here is a short link https://godbolt.org/z/3PKeT5Wee that better explains my intention.
Thank you
Hi @pfeatherstone, thanks for your interest in Flux! The good news is that there isn't actually a Flux bug here :)
As a C++98 algorithm, std::inner_product
expects a (begin, end) iterator pair as its first two arguments. However, as mentioned in the documentation, for a non-bounded_sequence
Flux instead returns a C++20 sentinel from its end()
function. (If you think about it, this makes sense: if Flux doesn't know where the end of the sequence is, it can't produce an iterator to that position in constant time.)
There are several ways we can fix this:
-
The C++20 versions of the classic STL algorithms, in namespace
std::ranges
, have been updated to accept iterator/sentinel pairs rather than classic iterator/iterator pairs. Unfortunately, there is nostd::ranges::inner_product
so that's not an option in this case (although I do have another small library that offers it) -
We could use the
flux::cache_last
adaptor on thep::circular_buffer
to turn it into abounded_sequence
. However, that would be doing more work than we really need to in this case. -
We could use the C++20
std::views::common
range adaptor to transform the range into astd::ranges::common_range
, that is, one wherebegin()
andend()
return the same type so we can call the legacy algorithm. This is exactly whatviews::common
is for, but it has a small amount of overhead and it so happens that we can avoid in this case. -
Since the
circular_buffer
class has the ability to produce the end position in constant time, the best approach is to turn it into abounded_sequence
by adding a one-line implementation oflast()
to itsflux_sequence_traits
. If we do that, then everything compiles correctly: https://godbolt.org/z/aa7194ej7
Unfortunately, as you can see, the runtime of the Flux version is not as good as the custom circular_buffer::dot_product()
implementation. From what I can tell, this is because of the way the custom version uses its special knowledge of the circular_buffer
internals. It makes two calls to std::inner_product
, each of which uses a simple linear data access pattern. Because of this simple access pattern, the compiler is able to auto-vectorise both calls to inner_product
, resulting in excellent run time.
For the Flux version, there is only a single call to std::inner_product
over the whole sequence, but the access pattern (using v.buf[(cur + v.offset) % v.size()]
) seems to be too complicated for the auto-vectoriser to see through, so we don't get any SIMD and things run slower.
This is just a consequence of the custom circular_buffer::dot_product()
having extra knowledge about the implementation and the compiler being able to optimise using that knowledge. If you were to add a custom circular_buffer::iterator
implementation and made a single call to std::inner_product
with it, you'd very likely get the same runtime as the version using Flux.
I hope this helps! I'm going to close this as it's not a Flux bug, but feel free to comment further if you have any questions, or file other issues if you come across any other problems :)
Thank you for the great response! Everything you wrote makes sense and I didn't know about the extra customisation point in the flux traits.
Out of interest, do you intend on adding numeric algorithms to flux like reductions, such as inner_product?
Also, is there a way to adapt a container as the concatenation of two contiguous containers?
And if so, could flux have overloads of numeric algorithms for concatenated contiguous sequences such that they could make use of optimisations similar to what I did in dot_product ?
Also, I have to say, flux is SO MUCH NICER than c++ iterators, STL algorithms and ranges! Adapting containers in flux is approximately 42000 times easier than writing custom iterators or range adapters
Thank you for the great response! Everything you wrote makes sense and I didn't know about the extra customisation point in the flux traits.
Out of interest, do you intend on adding numeric algorithms to flux like reductions, such as inner_product?
We already have quite a few such algorithms, although we don't have an inner_product
with the same default operations as the std version. However, you can use flux::zip_fold()
with a lambda:
std::vector<int> vec1 = get_vector1();
std::vector<int> vec2 = get_vector2();
auto fn = [](int sum, int x, int y) { return sum + (x * y); };
auto result = flux::zip_fold(fn, 0, vec1, vec2);
Also, is there a way to adapt a container as the concatenation of two contiguous containers?
Yes. The flux::chain
adaptor takes an arbitrary number of sequences and "glues" them together as one logical sequence. For example:
std::vector vec1{4, 5, 6};
std::vector vec2{1, 2, 3};
auto chained = flux::chain(std::move(vec2), std::move(vec1));
chained.write_to(std::cout); // prints [1, 2, 3, 4, 5, 6]
In general, the chained adaptor uses the highest common sequence category of all of its parts (except that it can never be contiguous, for obvious reasons). In this case the chained
sequence is random-access.
And if so, could flux have overloads of numeric algorithms for concatenated contiguous sequences such that they could make use of optimisations similar to what I did in dot_product ?
For the (majority of) algorithms which take a single sequence and iterate over it in order from start to finish, the chain
adaptor already does this. For example, something like
// using vec1 and vec2 from above
auto total = flux::chain(flux::ref(vec1), flux::ref(vec2)).sum();
will effectively call the sum()
algorithm on vec1
followed by vec2
and then add the results together. This leads to excellent code generation in a lot of cases. (The key to this is specialising the for_each_while
customisation point to allow internal iteration.)
For algorithms like dot_product
that operate on multiple sequences however, things are much much trickier to implement generically, as we don't know which is the "special" sequence that we want to dispatch to -- and what happens if both sequences are "special"?
I haven't been able to come up with a good way of doing this for multiple sequences, though it would certainly be nice if it turns out to be possible. In the mean time, for something like dot_product
I think your approach of using a custom member function is probably going to offer the best performance.
Also, I have to say, flux is SO MUCH NICER than c++ iterators, STL algorithms and ranges! Adapting containers in flux is approximately 42000 times easier than writing custom iterators or range adapters
Thank you very much, this was a major goal when I was designing the library so I'm delighted to hear it was successful!
Thank you again. I added your zip_fold
version, a version using rangeV3 and compared with clang. Yeah, I think a custom algorithm is best. Sad. However, I think flux is faster than ranges.
https://godbolt.org/z/MKx69n7br
Edit:
Running this on my machine I get:
STL dot 838300, computed in 17.433369 ms
STL rdot 671650, computed in 17.683371 ms
Flux dot 838300, computed in 47.490080 ms
Flux rdot 671650, computed in 47.387517 ms
Flux dot 838300, computed in 50.761227 ms
RangeV3 dot 838300, computed in 48.475297 ms
RangeV3 rdot 671650, computed in 93.008346 ms
So maybe flux and ranges have similar performance.
I have good news!
I had a look at your link and saw that you changed the implementation of read_at()
for circular_buffer
to try to simplify the access pattern. This gave me an idea: by removing the conditional branch and using a multiplication instead, it looks like the compiler is able to recognise what we want to do and vectorise the Flux version too, giving a run-time which is on par with the custom dot_product
EDIT: Sadly not, I made a mistake and it doesn't actually work :(
@tcbrindle Awesome! Though it looks like you need gcc 12 onward. gcc 11 didn't seem to optimize as well. Compilers are magic