mmmultimap

memory-mapped multimap

This implements a memory backed multimap intended for use where:

your keys are integers, or can be mapped to dense range of integers,
the memory mapped file is on fast storage, like an SSD (although this is not a requirement),
you have arbitrary values of fixed size (e.g. structs, other POD types) that can be sorted,
you don't need dynamic updates of the table,
and you are likely to run out of memory of you use a traditional map or hash table,
but you can handle approximately 1 bit per record in RAM.

These may seem to be very specific, but many problems can be mapped into a dense integer set. mmmultimap developed first as a data structure to support seqwish, which uses it to generate precise variation graphs from pairwise alignments between collections of sequences. As this multimap forms a key data processing kernel in the algorithm, it can scale to extremely large problem sizes, limited only by available disk space. Although performance is much slower than an in-memory structure, we are virtually guaranteed to be able to complete the compute.

algorithm and usage

To construct the mmmultimap:

mmmulti::map<uint64_t, uint64_t> mm("temp.dat");

We can then add keys:

mm.append(key1, value1);
mm.append(key1, value2);
mm.append(key1, value3);
mm.append(key2, value4);
mm.append(key2, value2);

This is thread-safe when using OpenMP for parallelism. Each OMP thread writes to a different file. These are combined during the indexing phase.

To query the mmmultimap, first index it, providing the maximum key to expect (remember, we're working on dense keys!):

mm.index(max_key);

If we index without providing a maximum key, like this:

mm.index();

... then we don't pad the multimap, and we can only enumerate the keys inside using e.g. for_each_pair. This can have some advantages, such as not requiring that our values be non-null (positive integers for instance, or structs with non-null entries). This allows us to simply use the sorted array functionality of the mmmultimap.

indexing algorithm

If max_key is specified, we first pads the records with one key/null record per key in the range of [0,max_key]. Then, we concatenates the thread-specific files we generated by calling append into one file. Then, we memory map this file and apply the ips4o in-place parallel super scalar samplesort to the memory mapped buffer to order the records. When padded (max_key is specified), the index is completed by building a bitvector of length max_key, marking 1 for the first instance of a key in the sorted order, and building an auxiliary data structure to support select_1(n) queries on it that allow us to find the records associated with a given key. Without this index, we can only iterate through the key/value pairs.

supported queries

It is now possible to iterate through the records in order with for_each_pair or in parallel with for_each_pair_parallel. We can look up the nth key or value with nth_key and nth_value. And we can iterate over the values or unique values of a given key with for_values_of and for_unique_values_of.

building and testing

mmmultimap is intended to be used as a library in other applications. It can be included easily as a CMake ExternalProject. A test utility is included to demonstrate usage. To build it:

cmake -H. -Bbuild && cmake --build build -- -j 4

And to run some tests:

bin/mmmultimap -t x -s 10000000 -t 4
10040123 keys
15099011 values
10688272 unique pairs
rm -f x # removes test file

development

By adding a PMHF to the frontend, it should be possible to project arbitrary key sets into the dense range required by mmmultimap with only a few bits of overhead per entry. This would also obviate the need to pad our key space.

author

Erik Garrison erik.garrison@gmail.com

license

MIT

jeizenga/mmmultimap