CamDavidsonPilon/tdigest

Corner case where values are identical?

microprediction opened this issue · 1 comments

I'm interested in the case where a variable takes on discrete values. I created tdigest notebook to illustrate what might be an interesting issue.

Suppose I have sampled many rolls of a die. If I add a tiny amount of noise then tdigest works just fine as a nice representation of the data, with quite an accurate cdf and percentiles.

However, if you run the same spreadsheet with HACK=False then only six centroids are created. This leads to gross inaccuracy in both cdf and percentiles.

I am wondering if there could be a trick here, in order for tdigest to be able to handle cases like this without my hack.

hi @microprediction, I'm surprised it fails so bad for discrete data. I don't know a solution immediately, this will take some thought...