DataDog/sketches-go

Can one get CountBelow / CountAbove a percentile?

senser-adi opened this issue · 5 comments

I would like to use the histogram to get the number of values above/below percentile X.
This could be very helpful since I already got the data (approximated, with the required accuracy).

I couldn't find any such API or internal logic.
Is there a way to get this? If not, is this planned?

Thanks a lot for this great tool!!


Describe what happened:

Describe what you expected:

Steps to reproduce the issue:

Hi, happy to hear that you're using DDSketch. For your question, I guess there are a couple possibilities. For a quantile q (e.g., 0.5), for "the count of values below q" we could use: 1) q * GetCount() ; 2) the sum of all the counts for buckets including the bucket corresponding to the rank for q; 3) the sum of all the counts for buckets up to but not including the bucket corresponding to the rank for q.

I think option 1 is probably the best. I think the error for options 2 and 3 could be arbitrarily large depending on the data and q. E.g., for option 2, if the bucket corresponding to the rank for q holds all the data below q, and the rank happens to be close to the right edge of the bucket, option 2 will return 0. For option 3, we have a similar situation if the rank happens to be close to the left edge of the bucket and the bucket holds all the data above q.

Thanks @githomin for the quick reply..

  1. the sum of all the counts for buckets including the bucket corresponding to the rank for q;

I thought about that option, but I couldn't find the way to get the (1) list of buckets (2) count per bucket. Is there any recommended way to get this from the DDSketch object?


A follow up question - Is there a recommended way to get CountAbove/Below(value) , not a percentile?
The straightforward way would be doing a search (e.g. binary) on the GetValueAtQuantile() in a chosen resolution (e.g. 1%) and finding the closest Quantile matching this value, and then using one of the methods you described above.
Does this make sense?

thanks again!

For the follow-up question, you can use the ForEach method to get the count up to a value. For the initial question, I guess you could do a GetValueAtQuantile(q) and then use ForEach, but it would be much simpler to just do q * GetCount(). Can you elaborate on what the use case is?

Thanks again @githomin !
I agree that using q * GetCount() is the best way to know number of values above

The ForEach was exactly what i missed and needed..
Per your question about the use-case:
I'm gathering latencies from distributed entities and aggregating them into a single final sketch.
Then, I might get a "target" in [mSec] (value) as input to return how many samples were above/below (% of Total). For that i need the ForEach to count all values below this value as you explained...
Hope I got everything right!
I guess we can close the issue...

Thanks!

Makes sense to me. Happy to help!