Sampling data from a continuous stream of data is a useful technique to efficiently extrapolate information from a potentially large body of data. There are a couple of sampling strategies in literature that vary in their degree of complexity. I'd like to introduce you to a rather simple sampling strategy that is easy to implement as well as easy to reason about and might take you a long way until you have to go for more advanced solutions. I'm talking about Bernoulli sampling.
You'll find an implementation of a curtailed Bernoulli sampler in this repository (cf. Curtailed
). You'll also find a small showcase that demonstrates the integration of this sampling strategy with a Kafka-based stream processor that consumes domain events.
Have a look at the corresponding blog for a detailed explanation on the implemented sampling strategy.
This work is released under the terms of the LGPL v3.