The trimmed mean estimate is bad
DataAI opened this issue · 2 comments
The implementation of the trimmed mean estimate (trimmed_mean method) doesn't look right. The estimate seems way off from the real value. Here is an example:
import numpy as np
from tdigest import TDigest
Creat 10000 samples of random uniform distributition.
x = np.random.random(size=10000)*100
Create a T-Digest for this
d = TDigest()
d.batch_update(x)
Estimate the trimmed mean of X that above the 25% percentile.
tm_estimate = d.trimmed_mean(25,100)
print(tm_estimate)
75.0410094085
Now, find the real 25% percentile and compujte the real trimmed mean.
x_25 = np.percentile(x,25)
x_trimmed = x[x>=x_25]
tm_real = x_trimmed.mean()
print(tm_real)
62.3013933259
Awesome, thanks for bug report. Any ideas on what might be causing it? I'll look into it a bit this evening
First, you did interpolation only on the left edge.
The code inside if p1 < t + k_i:
clause does a linear interpolation based on the centroids that are at next to the lower boundary. Note that t
is increasing after each iteration, the condition p1 < t + k_i
will always be satisfied for centroids that are larger than the lower boundary. The interpolation is only needed for boundary centroids and nu
should be just 1
for non-boundary centroids.
for i, key in enumerate(self.C.keys()):
c_i = self.C[key]
k_i = c_i.count
if p1 < t + k_i:
if i == 0:
delta = self.C.succ_item(key)[1].mean - c_i.mean
elif i == len(self) - 1:
delta = c_i.mean - self.C.prev_item(key)[1].mean
else:
delta = (self.C.succ_item(key)[1].mean - self.C.prev_item(key)[1].mean) / 2.
nu = ((p1 - t) / k_i - 0.5) * delta
s += nu * k_i * c_i.mean
k += nu * k_i
if p2 < t + k_i:
return s/k
t += k_i