aws/random-cut-forest-by-aws

RCF 4.0

sudiptoguha opened this issue · 0 comments

This issue initiates RCF 4.0. The main two thrusts are further optimizations of memory and distributed access beyond the existing methods.

  • Streaming algorithms and data structures provide unified treatment of sequential analysis (as in Sequential Design of Experiments) and small space algorithms. While sequential analysis should be standard in time series, in some special cases, sequential analysis can be bypassed. However such an adjustment will not provide the same answers on a different input type. Such bypasses can become unhelpful in explainability or understanding the results of a sequential algorithm. This is particularly true for timed sketch data structures such as RCF, specially when shingle size is greater than 1. Thus this library aims to provide simple APIs for sequential analysis that can be applied to generic sequences. Two advantages ensue: first, if the data is already stored (in arrays, contiguous segments) then RCF can be constructed with significantly lower state (memory). Second, the sequential analysis can perform calibration in a streaming manner as well and provide an efficient explanation and error measurement of any inference. Information theoretically, a self calibration has more information (and thus less entropy) than any post hoc callibration/explanation.

  • Streaming algorithms can also unify distributed data — provided sequentiality is not required or is limited. This library aims to provide similar simple APIs that clarifies the use the RCF sketches in such a distributed setting. This is most relevant to shingle size being equal to 1 settings. Once again, if the data is already stored (as is common assumption in distributed setting) this can save memory and communication in building models for inference on distributed data. Likewise, a distributed calibration and explanation (perhaps with some more error) can be performed more efficiently, in terms of communication.

In effect the improvements would separate, clarify, and provide reference points of the usages of RCF in the two disparate settings mentioned above.