/data-sampler

Random sampling from stream

Primary LanguageJava

Data sampler

Generate random representative sample with given length from stream.

Algorithm, used for sampling implementation - https://en.wikipedia.org/wiki/Reservoir_sampling

Brief application description

To produce data sample we need RNG. Three different options supported:

Stream to be consumed:

  • stdin
  • internal emulator (produces given number of bytes). emulator uses the same RNG, as for the algorithm.

Whole program left thread-unsafe for simplicity.

Another option was to implement with PipeInput(Output)Stream and several threads. This way was intentionally discarded, since the whole app would be more error-prone and complicated.

Build

Thanks to Gradle wrapper, there is nothing really difficult here.

Make sure you have JAVA_HOME env variable set to JDK 1.7/1.8

Run ./gradlew check to build and run tests

Run ./gradlew installApp to build and produce directory with ready-to-run application.

./build/install/sampler/bin/sampler is the target script.

Usage

CLI help:

sampler [options...]
 --byte                                 : Treat stream as bytes (text by
                                          default) (default: false)
 --emulate                              : Set flag to use internal random
                                          stream (default: false)
 --emulate-length N                     : Bytes to generate, defaults to 2048
                                          (default: 2048)
 --generator [STANDARD | FAST | SECURE] : Random generator (defaults to classic
                                          Random) (default: FAST)
 --length N                             : Specify sample length

  Example: sampler --byte --emulate --emulate-length N --generator [STANDARD | FAST | SECURE] --length N