Todo list
folkvir opened this issue · 7 comments
- HyperLogLog
- Top-K or Mine Top-K: Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams, Younghee Kim et Al.
- Min Hash Wikipedia source, need to find the paper
- Scalable Bloom Filter, as the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.
- XorFilter #29
Feel free to modify this list or add suggestions in comments.
I don't need it. I didn't try it. But for the sake of completeness: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/
Links to the paper and implementations (no js ones) are available there.
( Hello btw! :3 )
Some update on this topic:
- The HyperLogLog has been implemented in release
v1.1.0
. - The TopK has been implemented in release
v1.2.0
. It uses a Count Min Sketch for frequency estimation and a Minheap for implementing a sliding window of values.
Some update: The MinHash has been released in v1.3.0
.
Thanks for your jobs.
Update: XOR filters has been released in the 2.0.0
version of the npm package, following #52 merge.
Update: the Scalable Bloom Filter with optimizations and bug fixes have been released in version 3.0.0
Hi @Callidon and @folkvir - thanks for a really useful library!
I had a suggestion for an additional data structure, as well as a question/suggestion for improvement on the CountingBloomFilter.
I have a need for a Time-To-Live, or "expiring", CountingBloomFilter. That is, when entries are added to the CountingBloomFilter, they will automatically be removed after some number of milliseconds. In addition, upon creation of the CountingBloomFilter, the creator can optionally specify a "dispose" callback function that will get called when the entry expires. The idea is generally to merge the concepts of a TTL Cache (e.g. https://github.com/isaacs/ttlcache) with a CountingBloomFilter. Obviously this could just be built as a separate data structure on top of CountingBloomFilter that manages the timeouts, but I think there is a lot of utility to having it in one single data structure. I am going to implement this for a project I need, but please let me know if you would like me to contribute this back to your repository.
My other question relates to the implementation of the _filter
member of the CountingBloomFilter
class. This Array stores a tuple pair at each index that is [bit 0 or 1, count of items at this index]. But that first bit is always set to 1 if count of items is > 0, and set to 0 if count of items === 0. So I might be missing something but why not just make _filter
an array of numbers where each entry is the count of items at that index, and then the has
method would just look at whether that value is > 0? Seems like having each value in the _filter
member storing an Array object with 2 numbers is a lot of unnecessary memory.