When you don't have enough memory to sort a large number of values an external sort can be used.
This library implements an external sort algorithm using Gabriel Gonzales'
pipes
library and Ben Gamari's
pipes-interleave
extension.
The library exports two functions, externalSortFile
and externalSortHandle
. The former is
simply a wrapper around the latter. The first argument to both functions is a configuration
value with the polymorphic type ExternalSortCfg a
. The configuration value contains:
readVal
: a function to read a value of typea
from aHandle
writeVal
: a function to write a value of typea
to aHandle
chunkSize
: the number of values that will be read into memory at any one timesorter
: a sort algorithm to use on each chunk. (Rather than choose one for you, you are free to use your own)comparer
: a function to compare two values
It is up to the user to work out how much memory chunkSize
values will use at it will depend
on the type being read/written.
The algorithm proceeds in two stages. In the first phase chunkSize
values are repeatedly read
into memory, sorted, and then written to temporary intermediate files. The number of intermediate
files should be ceiling (n div chunkSize)
where n
is the number of values in the file.
In the second phase all of the temporary intermediate files are opened for reading and a k-way
merge is performed on the values. The results are streamed to the output file/handle (depending
on which algorithm you use). It is important that your operating system allow this many open files
otherwise you will receive a openFile: resource exhausted (Too many open files)
exception.