Can one get CountBelow / CountAbove a percentile?
senser-adi opened this issue · 5 comments
I would like to use the histogram to get the number of values above/below percentile X.
This could be very helpful since I already got the data (approximated, with the required accuracy).
I couldn't find any such API or internal logic.
Is there a way to get this? If not, is this planned?
Thanks a lot for this great tool!!
Describe what happened:
Describe what you expected:
Steps to reproduce the issue:
Hi, happy to hear that you're using DDSketch. For your question, I guess there are a couple possibilities. For a quantile q (e.g., 0.5), for "the count of values below q" we could use: 1) q * GetCount() ; 2) the sum of all the counts for buckets including the bucket corresponding to the rank for q; 3) the sum of all the counts for buckets up to but not including the bucket corresponding to the rank for q.
I think option 1 is probably the best. I think the error for options 2 and 3 could be arbitrarily large depending on the data and q. E.g., for option 2, if the bucket corresponding to the rank for q holds all the data below q, and the rank happens to be close to the right edge of the bucket, option 2 will return 0. For option 3, we have a similar situation if the rank happens to be close to the left edge of the bucket and the bucket holds all the data above q.
Thanks @githomin for the quick reply..
- the sum of all the counts for buckets including the bucket corresponding to the rank for q;
I thought about that option, but I couldn't find the way to get the (1) list of buckets (2) count per bucket. Is there any recommended way to get this from the DDSketch object?
A follow up question - Is there a recommended way to get CountAbove/Below(value) , not a percentile?
The straightforward way would be doing a search (e.g. binary) on the GetValueAtQuantile() in a chosen resolution (e.g. 1%) and finding the closest Quantile matching this value, and then using one of the methods you described above.
Does this make sense?
thanks again!
For the follow-up question, you can use the ForEach
method to get the count up to a value. For the initial question, I guess you could do a GetValueAtQuantile(q)
and then use ForEach
, but it would be much simpler to just do q * GetCount()
. Can you elaborate on what the use case is?
Thanks again @githomin !
I agree that using q * GetCount() is the best way to know number of values above
The ForEach
was exactly what i missed and needed..
Per your question about the use-case:
I'm gathering latencies from distributed entities and aggregating them into a single final sketch.
Then, I might get a "target" in [mSec] (value) as input to return how many samples were above/below (% of Total). For that i need the ForEach to count all values below this value as you explained...
Hope I got everything right!
I guess we can close the issue...
Thanks!
Makes sense to me. Happy to help!