CamDavidsonPilon/tdigest

The trimmed mean estimate is bad

DataAI opened this issue · 2 comments

The implementation of the trimmed mean estimate (trimmed_mean method) doesn't look right. The estimate seems way off from the real value. Here is an example:

import numpy as np
from tdigest import TDigest

Creat 10000 samples of random uniform distributition.

x = np.random.random(size=10000)*100

Create a T-Digest for this

d = TDigest()
d.batch_update(x)

Estimate the trimmed mean of X that above the 25% percentile.

tm_estimate = d.trimmed_mean(25,100)
print(tm_estimate)
75.0410094085

Now, find the real 25% percentile and compujte the real trimmed mean.

x_25 = np.percentile(x,25)

x_trimmed = x[x>=x_25]
tm_real = x_trimmed.mean()
print(tm_real)
62.3013933259

Awesome, thanks for bug report. Any ideas on what might be causing it? I'll look into it a bit this evening

First, you did interpolation only on the left edge.

The code inside if p1 < t + k_i:clause does a linear interpolation based on the centroids that are at next to the lower boundary. Note that t is increasing after each iteration, the condition p1 < t + k_i will always be satisfied for centroids that are larger than the lower boundary. The interpolation is only needed for boundary centroids and nu should be just 1 for non-boundary centroids.

for i, key in enumerate(self.C.keys()):
            c_i = self.C[key]
            k_i = c_i.count
            if p1 < t + k_i:
                if i == 0:
                    delta = self.C.succ_item(key)[1].mean - c_i.mean
                elif i == len(self) - 1:
                    delta = c_i.mean - self.C.prev_item(key)[1].mean
                else:
                    delta = (self.C.succ_item(key)[1].mean - self.C.prev_item(key)[1].mean) / 2.
                nu = ((p1 - t) / k_i - 0.5) * delta
                s += nu * k_i * c_i.mean
                k += nu * k_i

            if p2 < t + k_i:
                return s/k
            t += k_i