ajwerner/tdigestc

possible bug in td_quantile_of

tvondra opened this issue · 18 comments

While looking at td_quantile_of I think I've found another bug. My knowledge of go is pretty limited so I'm not sure if it's present in the other repository, and unlike the previous issue I don't have a reproducer demonstrating the issue yet.

When computing the quantile, the code does this:

     if (val == n->mean) {
          // technically this needs to find all of the nodes which contain this value and sum their weight
          double count_at_value = n->count;
          for (i += 1; i < h->merged_nodes && h->nodes[i].mean == n->mean; i++) {
               count_at_value += h->nodes[i].count;
          }
          return (k + (count_at_value/2)) / h->merged_count;
     }

which makes perfect sense - there may be multiple centroids with the same mean, and we need to account for all of them (and then take 1/2 of the count). So far so good.

But when the value is not exactly equal to a mean of a centroid, the code ignores this possibility. It simply does this:

     node_t *nr = n;
     node_t *nl = n-1;

and entirely ignores the possibility there might be multiple centroids with the same mean, both for nr and nl. So the weights used for linear approximation will be somewhat bogus, not quite representing values from all the relevant centroids.

I think the code should do the same lookup as in the val == n->mean branch for both nr and nl.

I'm less convinced that this one is a bug. I think this interpolation is pretty sane. Let's work through a small example. For shorthand a centroid is {mean, count}.

Say our merged centroids are:
[{0, 1}, {1, 1}, {1, 1}, {1, 1}, {2, 1}] and our query td_quantile_of(h, 1.1).

The result of this needs to definitely be between .7 and .9.

In the current code we'll do:

k == 4 - 1/2 == 3.5
m == ( 2 - 1 ) / ( .5 + .5 ) == 1
x = (1.1 - 1) / m == .1
return (3.5 + .1) / 5 == .72

If we merged all of the central centroids as you suggest we'd get:

k == 4 - 3/2 == 2.5
m == (2 - 1) / (1.5 + .5 ) == .5
x = (1.1 - 1) / m = .2
return (2.5 + .2) / 5 == .54

Sort of in the area, I think that the logic you pointed to is buggy if the query is exactly the left-most or right-most point and there is only one centroid at that point then it might be reasonable to return 0 or 1. It all sort of depends on definitions I suppose.

           for (i += 1; i < h->merged_nodes && h->nodes[i].mean == n->mean; i++) {
                count_at_value += h->nodes[i].count;
           }
+          // If val is exactly the left-most or right-most point and there is
+          // only one of those points then call it the 0th or 100th percentile.
+          if (i == 1) {
+               return 0;
+          } else if (i == h->merged_nodes) {
+               return 1;
+          }

Again, I do appreciate the interest. The logic to merge centroids of the same value is missing in my go implementation and I'll add it. Thanks for bringing it. I did go ahead and hook up the testing suite from the go library to this one.

What seems strange to me is that this depends on the order of the centroids, and I don't think there's any guarantee they will all have the same count value. So maybe you'll find them like this:

[{0, 1}, {1, 1}, {1, 2}, {1, 3}, {2, 1}]

or maybe

[{0, 1}, {1, 3}, {1, 2}, {1, 1}, {2, 1}]

and this will affect the result of td_quantile_of, although the digest is essentially the same.

IIRC the paper does say the ordering should be deterministic even for centrois with the same mean, i.e. the compare_nodes comparator seems to be a few brick shy. There seem to be only few options - we may order either by count or mean * count, and asc/desc. The question is, is one of those options the most correct and/or superior?

That makes sense. I now see what I'm trying to do with the summing wouldn't make sense for the left centroid in quantile_of and the current just pick whichever one doesn't work either.

Having multiple centroids of the same value is a generally meaningful occurrence in a tdigest, it implies that the data actually really does refer to the same data point. It's likely the case that we can contrive a dataset that could end up having many centroids of the same average value that didn't correspond to a frequent actual value but I suspect that doesn't happen naturally. It would be a shame to throw that information away and treat [{1, 1}, {1, 1}, {1, 1}] the same as a [{1, 3}]. I wonder if something else makes sense like taking the minimum count or the average count of all of the centroids with the same mean. Minimum seems especially compelling to me.

I think what makes the example fail is that all the centroids have count 1, which is the one case when you know where the inputs were. So in that case it does not make much sense to subtract 1/2 the count (even with the first formula) because that's based on the assumption that 1/2 the points is on the left and 1/2 on the right. But when there's just 1, it's in the mean itself. So maybe it shouldn't be subtracted, and we should only use the other centroid for the approximation?

IMHO that's essentially what Fig. 4 on page 15 in the paper describes, except that we have multiple such centroids with the same mean, so the step will just be larger.

FWIW I do agree throwing away the extra information about centroids with count=1 would be a shame (it's likely somewhat rare, though).

That's another great point. If we know we have only one point then our interpolation function could take advantage of that. This assumption will only work if you know you're adding data with integer counts. The had such an optimization but ripped it out to deal with added non-integer count values. If you know you're dealing with just integers something like (count-1)*interpolation + mean)/count could work (and then invert that for quantileOf).

I do think just taking the minimum count of the centroids with the same value on either side is going to go a long way.

Not sure - it does not seem particularly principled. For example, assume you have

[{0, 1}, {1, 10000}, {1, 10}, {2, 1}]

Using the minimum count (i.e. 10) essentially means you'll ignore the other centroid, representing much larger chunk of data. That seems strange.

Also, imagine you have two digests

[{0, 1}, {1, 10}, {1, 10}, {2, 1}]
[{0, 1}, {1, 20}, {2, 1}]

That essentially contain the same amount of information (at least it seems to me). Why'd you use 10 for the first one, but 20 for the second one?

So I'm not quite convinced using the minimum (or maximum) is better than just aggregating all the centroids as I proposed earlier, with the extra handling of centroids with count=1. Of course, if the count is not stored as an integer, that may not be possible :-(

That essentially contain the same amount of information (at least it seems to me). Why'd you use 10 for the first one, but 20 for the second one?

An argument I was trying to make above is that having two centroids with the same value is rare and carries information.

Try pick some value. Then to come up with a stream of data that does not contain that value (or at least does not commonly contain a value) but has multiple centroids that have a mean equal to exactly that value. I think it's hard. Using the minimum says that we believe all other instances of a centroids which shares a mean can thought of as definitely less than the query point.

An argument I was trying to make above is that having two centroids with the same value is rare and carries information.

I don't think having multiple centroids with the same mean carries information, unless the count is 1.

Try pick some value. Then to come up with a stream of data that does not contain that value (or at least does not commonly contain a value) but has multiple centroids that have a mean equal to exactly that value. I think it's hard. Using the minimum says that we believe all other instances of a centroids which shares a mean can thought of as definitely less than the query point.

I'm not sure it's all that rare, TBH. Say we're dealing with response times, rounded to milliseconds, and that most requests are handled within 1 second. Chances are, most requests are handled in a fairly narrow interval, say 50 - 150 ms. So vast majority of data will contain those ~100 distinct durations. I think think it's quite likely we'll get multiple centroids with the same mean, in this case (particularly for larger digests).

Sure, this example may be a bit artificial, but it does not seem all that far fetched ...

I'm not sure it's all that rare, TBH. Say we're dealing with response times, rounded to milliseconds, and that most requests are handled within 1 second. Chances are, most requests are handled in a fairly narrow interval, say 50 - 150 ms. So vast majority of data will contain those ~100 distinct durations. I think think it's quite likely we'll get multiple centroids with the same mean, in this case (particularly for larger digests).

Sure, this example may be a bit artificial, but it does not seem all that far fetched ...

I don't think it's a far fetched or artificial example at all but I do think it would be sound and well served by the behavior I'm suggesting. First of all, imagine you do this uniformly. Think about the likelihood that you'll ever merge two centroids to the same exact float value mean if all of the contained points didn't all correspond to the same input value. If they do correspond to the same exact point then taking the min is principled. I encourage you to go through actually generating data for the examples you're thinking of. It's really hard to create multiple centroids at some exact value where the constituent inputs which comprise that centroid aren't that value.

If the data in a centroid does indeed correspond to a centroid where all the input data were the same point, then minimum of the count of the centroids which share that value is a good choice.

Here's a tdigest from a uniform distribution over 96e6 and 104e6 rounded to integers.

{96000000.00, 1.00}                                                                                                                                                                                                
{96000000.00, 5.00}                                                                                                                                                                                                
{96000000.00, 107.00}                                                                                                                                                                                              
{96000000.00, 2578.00}                                                                                                                                                                                             
{97759142.07, 38558.00}                                                                                                                                                                                            
{101287390.74, 49424.00}                                                                                                                                                                                           
{103535681.61, 6558.00}                                                                                                                                                                                            
{104000000.00, 2653.00}                                                                                                                                                                                            
{104000000.00, 113.00}                                                                                                                                                                                             
{104000000.00, 2.00}                                                                                                                                                                                               
{104000000.00, 1.00}  

What are the quantiles for 96.01ms (and then 103.99)? Ideally it will be at least .02691 = (1+5+107+2578)/100000. We know from the raw data that the true quantile of 96.01 must be at least that.

Total count = 100000

k == 1345.5
m == ( 97759142.07 - 96000000) / ((38558.0/2) + (2691/2)) ~=  85.29
x = (10000) / m ~= 117
return (1345.5 + 109) / 100000  ~= .0145

Now imagine if we use the minimum one:

k = 2690.5
m == ( 97759142.07 - 96000000) / ((38558.0/2) + (1/2)) ~=  91.244
x = (10000) / m ~= 109
return (2690.5 + 109) / 100000 ~= .02799

Ah, I see. I agree it seems hard to get multiple centroids with the same mean, unless all values in those centroids were equal to the mean.

I'm still not convinced using the minimum count is the right solution, though. What I think should be done instead is using 0 as the count (because everything is in the centre, so there's nothing on the right). If both prev/next centroids are like this, it's essentially a line with slope 0.

It does not make much difference in your examples, because the minimum counts are 1.0 in both cases, so not much difference. It might be more significant in the middle of the digest.

I’m okay with the idea of using 0 though it introduces special cases to the code. If you know all of your counts are integer values then using 1 seems pretty reasonable.

Hi, where did this end up? This seems to be the only impl linked from Dunning that has quantile_of, which seems pretty useful....

I haven't poked at this code really since I wrote it. https://github.com/tvondra/tdigest is almost certainly better if you're looking for a C implementation. I think that this particular issue isn't a huge deal in the grand schema of things.

I guess maybe his repo won't be easy to use in another project because of its postgres ties. Should have looked first.

PRs welcome if you want to get your hands dirty.