/sampl

Like "head" or "tail" but picks 10 random lines

Primary LanguageOCamlOtherNOASSERTION

                                    sampl
                 like "head" or "tail" but picks 10 random lines


sampl is a command-line tool for randomly picking a number of lines from
large data files. By default, it picks 10 lines like the Unix commands
"head" and "tail".

sampl is useful in order to get an idea of what's in a data file without
worrying whether the beginning of the file is representative of the rest.

sampl is fast because it doesn't scan the entire file(s) but simply picks
lines found at random positions.


Installation
============

Requires a standard installation of OCaml.

 $ make
 $ make install  # Installation directory defaults to $HOME/bin.

PREFIX and BINDIR are supported, so if you want to install sampl in /usr/local,
just do:

 $ sudo make PREFIX=/usr/local install


Uninstallation:

 $ make uninstall


Usage
=====

"sampl" is typically used as replacement for "head" on large data files
whose first records are often not representative of typical lines.


$ ./sampl --help
Usage: ./sampl [OPTIONS] FILE1 [FILE2 ...]

sampl picks 10 random lines from the input files without scanning 
the entire file.

Options:

  -i 
          Keep outputting random lines indefinitely instead of stopping
          after 10 lines. See also -n.
  -n NUMBER
          Specify sample size, i.e. the number of lines to pick.
          The default is 10 lines. See also -i.
  -s NUMBER
          Specify seed of the random number generator.
          By default, the seed is initialized randomly from system resources.
  -version 
          Print sampl version and exit.
  -help  Display this list of options
  --help  Display this list of options



Algorithm for picking a random line
===================================

1. A random byte position is chosen in the input files,
   just as if the files were concatenated as one big file.

2. The position is moved forward to the nearest beginning of line.

3. The line is read and printed to stdout.

This algorithm gives more chances to lines that follow long lines.
Therefore, lines are picked really "randomly" only if the length of a
line is statistically independent from any property of the line
that follows.

Also, the same line can be picked several times over the execution of the
program.