/mmmultimap

memory mapped multimap based on an in-place parallel binary sort and succinct indexes

Primary LanguageC++MIT LicenseMIT

mmmultimap

memory-mapped multimap

This implements a memory backed multimap intended for use where:

  • your keys are integers, or can be mapped to dense range of integers,
  • the memory mapped file is on fast storage, like an SSD (although this is not a requirement),
  • you have arbitrary values of fixed size (e.g. structs, other POD types) that can be sorted,
  • you don't need dynamic updates of the table,
  • and you are likely to run out of memory of you use a traditional map or hash table,
  • but you can handle approximately 1 bit per record in RAM.

These may seem to be very specific, but many problems can be mapped into a dense integer set. mmmultimap developed first as a data structure to support seqwish, which uses it to generate precise variation graphs from pairwise alignments between collections of sequences. As this multimap forms a key data processing kernel in the algorithm, it can scale to extremely large problem sizes, limited only by available disk space. Although performance is much slower than an in-memory structure, we are virtually guaranteed to be able to complete the compute.

algorithm and usage

To construct the mmmultimap:

mmmulti::map<uint64_t, uint64_t> mm("temp.dat");

We can then add keys:

mm.append(key1, value1);
mm.append(key1, value2);
mm.append(key1, value3);
mm.append(key2, value4);
mm.append(key2, value2);

This is thread-safe when using OpenMP for parallelism. Each OMP thread writes to a different file. These are combined during the indexing phase.

To query the mmmultimap, first index it, providing the maximum key to expect (remember, we're working on dense keys!):

mm.index(max_key);

If we index without providing a maximum key, like this:

mm.index();

... then we don't pad the multimap, and we can only enumerate the keys inside using e.g. for_each_pair. This can have some advantages, such as not requiring that our values be non-null (positive integers for instance, or structs with non-null entries). This allows us to simply use the sorted array functionality of the mmmultimap.

indexing algorithm

If max_key is specified, we first pads the records with one key/null record per key in the range of [0,max_key]. Then, we concatenates the thread-specific files we generated by calling append into one file. Then, we memory map this file and apply the ips4o in-place parallel super scalar samplesort to the memory mapped buffer to order the records. When padded (max_key is specified), the index is completed by building a bitvector of length max_key, marking 1 for the first instance of a key in the sorted order, and building an auxiliary data structure to support select_1(n) queries on it that allow us to find the records associated with a given key. Without this index, we can only iterate through the key/value pairs.

supported queries

It is now possible to iterate through the records in order with for_each_pair or in parallel with for_each_pair_parallel. We can look up the nth key or value with nth_key and nth_value. And we can iterate over the values or unique values of a given key with for_values_of and for_unique_values_of.

building and testing

mmmultimap is intended to be used as a library in other applications. It can be included easily as a CMake ExternalProject. A test utility is included to demonstrate usage. To build it:

cmake -H. -Bbuild && cmake --build build -- -j 4

And to run some tests:

bin/mmmultimap -t x -s 10000000 -t 4
10040123 keys
15099011 values
10688272 unique pairs
rm -f x # removes test file

development

By adding a PMHF to the frontend, it should be possible to project arbitrary key sets into the dense range required by mmmultimap with only a few bits of overhead per entry. This would also obviate the need to pad our key space.

author

Erik Garrison erik.garrison@gmail.com

license

MIT