Compact serialized form for BloomFilter
dleppik opened this issue ยท 5 comments
The serialized form of BloomFilter is often larger than the original data. I haven't tested in-memory usage, but since it uses an array of 64-bit 1's and 0's, the same is likely true. This implies that there is no advantage to a BloomFilter over a JavaScript Set when hashing short strings.
For example, an 8.2 Mb file (3.6 Mb gzip compressed) of 1 million leaked passwords serializes to 27 Mb (2.3 Mb compressed). This is with false positive tolerance of 0.1%.
I have written a fix for this which replaces the array with a BitSet backed by a Uint8Array and serializes the array with base64 encoding. This changes the 27 Mb filter to 2.3 Mb (1.3 Mb compressed).
I'm prepared to make a pull request for this fix, if you give me permission. Otherwise I can send you the changes a different way.
Hello
Indeed, that's a bit of an issue. It's great if you already found a solution. Your implementation idea seems good, so feel free to send a pull request with the final version ๐
On a side note, do you think this issue can happen with other data structures in the framework, aside from bloom filters? And if so, do you think your fix can be applied too?
Looks like I need to be a collaborator to do a pull request.
This can be applied as a drop-in replacement anywhere that you're just storing booleans. So, for example, you can't use it for a counting Bloom filter. However, the general principal applies. If your counting Bloom filter only counts up to 255, you can replace the array with a Uint8Array.
Right now you're using 64-bit Floats for everything. JavaScript implementations are smart enough to implement these with native 64-bit arrays. (I confirmed with "node --inspect" that the BitSet implementation is 64x more memory efficient.) So it makes sense to use a TypedArray where it will make a significant difference. The good news is that in many cases you can replace your allocateArray() with a constructor for the appropriate TypedArray and immediately see an improvement. Since what you really want in most cases is a 32-bit (or less) integer, you can cut your memory usage in half. Also, TypedArrays are initially all 0's, so you can eliminate that code.
I am using a Uint8Array so I don't have to deal with endianness when serializing. It's just a few extra lines of code to use a DataView, but then I'd feel the need to test serialization on multiple architectures.
Nice news ๐ You should be able to do a pull request If you first fork the repository and then request a PR from it. Otherwise, I will look into this when I have some spare time next week. I'm very interested in generalizing this idea to as many data structures as possible because the memory gain will be significant.
Except for IBLTs where Buffers are used, it might be worth to implement all structures with Uint8Arrays yes! But it must be usable in both server and browser sides, the library is used on both sides so the optimized structures must be usable in both environments. Another question, 0s and 1s where used primarily for simplicity of implementations but it is also useful for using the structure back and forth from/to another language/implementation without having to transform all the time the core structure into its simplest form. In consequence if we can safely choose which core implementation to use, then I completely agree with all future modifications. It will be a significant improvements to the memory usage ๐ฅ