sampl like "head" or "tail" but picks 10 random lines sampl is a command-line tool for randomly picking a number of lines from large data files. By default, it picks 10 lines like the Unix commands "head" and "tail". sampl is useful in order to get an idea of what's in a data file without worrying whether the beginning of the file is representative of the rest. sampl is fast because it doesn't scan the entire file(s) but simply picks lines found at random positions. Installation ============ Requires a standard installation of OCaml. $ make $ make install # Installation directory defaults to $HOME/bin. PREFIX and BINDIR are supported, so if you want to install sampl in /usr/local, just do: $ sudo make PREFIX=/usr/local install Uninstallation: $ make uninstall Usage ===== "sampl" is typically used as replacement for "head" on large data files whose first records are often not representative of typical lines. $ ./sampl --help Usage: ./sampl [OPTIONS] FILE1 [FILE2 ...] sampl picks 10 random lines from the input files without scanning the entire file. Options: -i Keep outputting random lines indefinitely instead of stopping after 10 lines. See also -n. -n NUMBER Specify sample size, i.e. the number of lines to pick. The default is 10 lines. See also -i. -s NUMBER Specify seed of the random number generator. By default, the seed is initialized randomly from system resources. -version Print sampl version and exit. -help Display this list of options --help Display this list of options Algorithm for picking a random line =================================== 1. A random byte position is chosen in the input files, just as if the files were concatenated as one big file. 2. The position is moved forward to the nearest beginning of line. 3. The line is read and printed to stdout. This algorithm gives more chances to lines that follow long lines. Therefore, lines are picked really "randomly" only if the length of a line is statistically independent from any property of the line that follows. Also, the same line can be picked several times over the execution of the program.