takes a large and continuous data stream and continuously only retain a reduced empirical CDF approximation
Moved on to sample-distribution, about 10x more accurate and about just as fast
small, simple, no dependencies
• Example • Features • Limitations • Why • API • License
HD = require('hdigest') // or import HD from 'hdigest'
// recorder with 7 retained samples with build-in weighting
var hd0 = HD(7)
hd0.push(4)
hd0.push([5,3,6,2,7,1,8])
hd0.push(0)
console.log(hd0.min) // 0
console.log(hd0.max) // 8
console.log(hd0.quantile(0.5)) // 4
console.log(hd0.quantile([0, 0.5, 1])) // [0, 4, 8]
- very small code and footprint for large number of instances
- less than 200 sloc, no dependencies, 2kb minified
- constant memory use, no compression steps and/or triggered garbage collection
- significantly faster than other implementations (about 3-5x faster)
- tested with random floats, discrete values, sorted values, repeated values and skewed distribution
- no other utility methods (use
npm lazy-stats
for mean and variance) - the remaining selected values do not preserve the mean of the inputs
Based on the work of Dunning with significant changes to the algorithm:
- Samples retained represent the maximum for a given rank (instead of value weighted centroid of fixed rank)
- No value interpolation. Values are kept as-is, but ranks are interpolated.
- Fixed length, every new value discards an old one
The above points are thought to yield the following benefits:
- The use of maxima instead of weighted centroids is closer to the underlying math (F(x) = P(X<=x))
- Use of simpler, faster fixed arrays instead of a tree structure
- No need for tree compresion steps
- Better handling of sorted data, discrete data and repeated identical values
- Faster, smaller footprint for hundreds of instances to measure hundreads of instruments
- No garbage collection required
- tdigest) based on the work of Dunning
- sample-distribution, 5-10x faster and more accurate (by some measure) than the above
Samples are retained depending on how close they are to the target CDF probability points to be retained.
The main function has 3 different input types:
{number} length
: length of the internal target CDF to be generated{Array<number>} CDF
: if strictly increasing from 0 to 1, the array will be used the target CDF{Array<number>} PDF
: if not a CDF, the array will be treated as a PDF to be summed and normalized into a CDF
Note that to preserve the maxima, a PDF will be padded with 0s at both ends if not already the case. This will result in a recorder length that is greater than the input PDF
.N
number: total samples received.probs
array: internal sigmoid/cdf used for selecting retained samples.values
array: selected retained sample values.ranks
array: interpolated ranks of retained samples.min
number: the minimum of all samples. Same asquantile(0)
.max
number: the minimum of all samples. Same asquantile(1)
.ave
number: an approximation of the sample average from the derived cdf
.push(number | array)
void: sample value(s) to be added.quantile(number | array)
{number | array}: value(s) for specified probabilitie(s).percentile(number | array)
same as quantile, different name (percentile values from the quantile function)