WriteTo() variant that exports only the bitset
maciej opened this issue · 9 comments
Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.
I'm eager to submit a PR to change that if we could agree on the design.
Of the top of my head we could do one of those things:
- expose
BloomFilter.b
– the simplest solution. However will allow developers to shoot themselves in the foot. - add a function that would return a copy of
BloomFilter.b
.*BloomFilter.BisetCopy
? - add a
*BloomFilter.WriteBitSetTo()
function? - and my favourite: add a
*BloomFilter.BitSetWriter()
function that would return a
type BitSetWriter interface {
WriterTo
}
Is From()
not sufficient? https://godoc.org/github.com/willf/bloom#From. We have been using that to serialize the BF in protobuf messages.
@db7 as I understand the GoDoc func From(data []uint64, k uint) *BloomFilter
creates a BloomFilter from a blob of data ([]uint64
here). I'd need the opposite – a ToBitSet(*bloom.BloomFilter) []byte
(or []uint64
as the return type) function.
Yes, you're right. But you can simply make a slice and pass it to the From
function. A uint64 slice will be anyway initialized to 0's, so it's safe to use that in the new BF. You'll just need to calculate the slice length by dividing m (number of bits) by 64.
So this...
data := make([]uint64, m / 64) // perhaps you'd like to ceil the division.
bf := bloom.From(data, k)
should be equivalent to this...
bf := bloom.New(m, k)
@maciej From
does not copy the data, it simply uses the given slice. So if you modify the BF (eg, adding items), the given slice is modified.
This is how we work/serialize BFs.
data := make([]uint64, m / 64)
// store reference to data slice somewhere
// whenever we need to update or check the BF, we create a BF object with the slice.
bf := bloom.From(data, k)
bf.AddString("whatever")
bf.TestString("whatever")
...
// whevener we need to serialize the BF, use use the data slice.
buf, err := serialize(data)
To make it a bit cleaner, one could create a wrapper for the BF+data slice. It won't take more space in memory since the data slice is not copied.
type SerializableBF struct {
*bloom.BloomFilter
data []uint64
}
func NewSerializableBF(m int, k hashes) *SerializableBF {
data := make([]uint64, m/64)
return &SerializableBF{bloom.From(data, k), data}
}
func (s *SerializableBF) Serialize() ([]byte, error) {
// serialize s.data in the format you want as buf byte slice
return buf, nil
}
I hope I am not completely missing your issue... and perhaps this is not the most elegant solution too.
@maciej The current format saves m
and k
followed by the bitset.
It is a rather thin format.
Admittedly using 64 bits per parameter is a bit wasteful but it does not seem like a big deal. Furthermore, if your false-positive and capacities are fixed, you can omit these parameters...
We are talking about 16 bytes... which we could reduce to 8 bytes easily... If that is a large fraction of your storage... I'd be curious about your use case?
Can you elaborate?
I am totally open to proposing something finer.
m := uint(b.GetM())
k := uint(b.GetK())
return bloom.FromWithM(b.GetBitSet().GetSet(), m, k)