Cherry-pick fixes from honeycombio fork?
Closed this issue ยท 14 comments
Seems like there are some optimizations in https://github.com/honeycombio/go-tdigest that might be nice. I'm not sure if @ianwilkes was planning on upstreaming them.
In a vacuum I don't plan on it. There's some work to do given that we've switched to uint64 counts and are way, way behind head (mainly because the fenwick tree change makes things slower for us), and my changes are only of interest to people handling large numbers of digests. I'm not sure if more of those people exist.
But if there's a desire, we can do it. The new serialization methods are huge improvements; the Merge()
optimization is less significant and frankly dangerous.
Thanks for bringing this up!
I was surprised to hear the fenwick tree slowed things down, but I'm guessing it's because merging and deserialization is frequent in your case? We should be able to do something about it or at least offer an option to use a summary implementation without the cache.
I think it would be nice to have it all merged together as the project would definitely benefit from the experience of large scale users, but maybe this can wait a bit.
From this I got that there's a need to:
- Add benchmarks for the (de)serialization and merge paths
- Optimize them: for deserialization, rebuilding the fenwick tree is likely unnecessary for every add
- Maybe provide an option to not use the fenwick tree at all if we can't squeeze nr.2 above
I won't be working on those in the very near future, but I'm happy to assist if someone picks it all up before I do.
We do deserialize and merge a great deal; we also have use cases with very large numbers of fairly small digests, where summary.Add
ends up being called much more frequently than in the benchmark scenario. I would definitely need an option to disable fenwick in these cases, especially since the improved summing method I just added is about 2X as fast as the old one fenwick replaced.
The biggest sticking point I can see right now is that we've switched entirely to uint64
counts, making most of our code incompatible, and adding a memory cost for people who don't need them. What do you think the right solution is here? Switching to 64-bit internally is straightforward, but supporting both adds a fair bit of complexity. I can imagine two flavors of summary, and potentially even upgrading from 32 to 64 on an as-needed basis, but it seems potentially brittle.
From an algorithm perspective, I'm curious that there's not a more efficient way to Compress
and Merge
. These are both big problems for us, again due to handling lots of low-count digests.
On moving counts to uint64
: I don't think we should fret much about changing it: whilst it's a relative big increase, a tdigest is tiny memory-wise so holding a few thousand 32bit integers vs 64bit shouldn't be a problem- I think we can move forward without much worry and figure out what to do if it becomes an issue to somebody else (Besides, the API now has mixed uint32 and uint64 and that bothers me more than it probably should ๐
).
As for improving Add
: I suspect allocations may be a major culprit:
- Allocations may be happening unexpectedly too often: for
t1.Merge(t2)
, if t2 is significantly larger than t1, theappend
call may need to allocate and copy to a new array several times - The fenwick tree gets rebuilt for every single
Add
, which means an extra allocation and copy per call (and the internal make for building the tree). It could be forked to instead hold/reuse its own buffer (so: a constructor that would allow a sorted []uint64 and resetting the tree array would be super efficient as appending to this kind of tree is super cheap)
Handling those cases better should give a nice performance boost in many scenarios. We could also go "complex" and disable using the cache only during Merge/Compress runs (and your faster prefixsum code would make it even better!).
Both Merge and Compress are designed to minimize pathological input scenarios where every new data point gets added as a new centroid instead of being merged with an existing one (chooseMergeCandidate
), so that's why the shuffling happens. Deep withing the issues of the original implementation it was shown that shuffling improved accuracy (sorry, I couldn't find it in a couple of search queries), but maybe doing something different- say, inserting the big numbers first- is already good enough. Or maybe you could replace your Merge calls with a ForEachCentroid
dance and compare results to see if you care at all about these. I'd be surprised to see Compress
also happening frequently for real world data- if it's just because of the recommendation to compress before serializing, it's an advice that can easily be ignored- especially since you're likely merging it on the other end of the pipe.
And now some handwavy unsolicited advice from my side, please ignore it if I'm shooting too far off the mark: From the "many small digests" remark I suspect that you're somehow map-reducing digests for queries on arbitrary customer data; So you map a sequence of numbers to a TDigest
and the reducer suffers from having to merge all those... If so, then what I'd say is that what likely works best is having a threshold where you decide wether you'll map to a serialized digest or just some thousands of numbers instead (you'd have to define the threshold empirically, but I'm guessing 10k numbers is a good starting point) so now there's no scenario where you are merging small digests: the reducer is either doing a straight Add
or merging with a somewhat big digest.
While writing this answer I've started to suspect the fenwick tree is a complete overkill as all that we need is a prefix sum from the first index to another (parameterized) one, so a trivial cache would work even better. Perhaps I got too excited upon learning about it and slapped it on the first project I found a use for ๐ I'll verify this when I can, but maybe bullet nr. 2 above is even simpler than it sounds.
Alright, if uint64 will work for you, I will - as time allows - move toward getting our fork into a condition where it could be merged.
The implementation you describe where we just keep a raw list of numbers up to a threshold is exactly what we have already. But, memory cost is a big concern as well; under some conditions, we're handling several million digests concurrently. So the threshold can't be too high.
I'd be interested to bake off the fenwick tree against the faster sumUntil. As written, fenwick is definitely faster for the large single-digest case expressed in BenchmarkAdd100
, and presumably much faster for larger compression values. As an experiment, I did actually implement a sumUntil method which memoized prefix sums, but the performance wasn't any better than brute-force adding. I suspect this was due to the cost of writing all the memoized sums to memory, but didn't dig very deep.
Awesome! I'll be happy to accept a patch with the uint64 changes and the performance improvements, even if it completely drops the fenwick thing- I can work on (optionally) enabling it as a follow up along with some performance measurements against the new code.
Thanks for elaborating on your usage scenario- the scale you're operating at is impressive and it's clear now why fast merges and serde are important.
@ianwilkes I am sorry to bother you, but do you have any plans to merge your work back to the main lib any time soon (I mean weeks, not months)? If not what do you think if I use some of your ideas in my patches? Is there any etiquette I should follow? Specifically I am interested in https://github.com/honeycombio/go-tdigest/blob/master/summary.go#L151-L165 and MergeDestructive
.
I've tried to just replace github.com/caio/go-tdigest with github.com/honeycombio/go-tdigest and that gives 40% performance boost which is a great result. But I also get negative quantiles which is most likely just a mistake on my side, but it is hard to reproduce and it seems easier to port you patches rather than debug the code I already have...
@vmihailenco I tried replying to you on #23 but github is swalling my comments it seems :/
So, regarding the negative quantiles: it's because honeycomb's master branch doesn't contain the fixes for #12 - So their code is faster because of ian's optimizations and because it holds a lot less centroids at once.
Have you tried the ian.update-head
branch? It should work properly for your tests
Oooh there's a timezone issue either on my side or on gh's side lol. My comments are appearing above yours and now I see this: https://imgur.com/a/Xowr15c and related, I'm in the future. I hope it's on my side (my clocks are fine though) otherwise it looks like the start of a bad day of fire fighting ๐คฃ
@vmihailenco Let's assume I won't get to it in the next few weeks, sorry. If you wanted to do a PR for upstream with the new sumUntil
implementation (with 32-bit ints of course), it would save me time later. Same with MergeDestructive
, if @caio wants that in the API. Note the ian.update-head
branch is not something I support or recommend, it was just an experiment.
I'm happy to accept patches with MergeDestructive and similar performance-oriented features ๐
With v2.2.0 (thanks all!) I believe we're now really close to being in sync. The only big thing missing this side, I think, is the move to 64bit counts.
I think we can call this done. I'm still thinking about a faster way to do Merge, but... I'm not thinking about it very hard.
Agreed, closing it. Thanks everyone!