This implements a memory backed multimap intended for use where:
- your keys are integers, or can be mapped to dense range of integers,
- the memory mapped file is on fast storage, like an SSD (although this is not a requirement),
- you have arbitrary values of fixed size (e.g. structs, other POD types) that can be sorted,
- you don't need dynamic updates of the table,
- and you are likely to run out of memory of you use a traditional map or hash table,
- but you can handle approximately 1 bit per record in RAM.
These may seem to be very specific, but many problems can be mapped into a dense integer set.
mmmultimap
developed first as a data structure to support seqwish, which uses it to generate precise variation graphs from pairwise alignments between collections of sequences.
As this multimap forms a key data processing kernel in the algorithm, it can scale to extremely large problem sizes, limited only by available disk space.
Although performance is much slower than an in-memory structure, we are virtually guaranteed to be able to complete the compute.
To construct the mmmultimap
:
mmmulti::map<uint64_t, uint64_t> mm("temp.dat");
We can then add keys:
mm.append(key1, value1);
mm.append(key1, value2);
mm.append(key1, value3);
mm.append(key2, value4);
mm.append(key2, value2);
This is thread-safe when using OpenMP for parallelism. Each OMP thread writes to a different file. These are combined during the indexing phase.
To query the mmmultimap
, first index it, providing the maximum key to expect (remember, we're working on dense keys!):
mm.index(max_key);
If we index without providing a maximum key, like this:
mm.index();
... then we don't pad the multimap, and we can only enumerate the keys inside using e.g. for_each_pair
.
This can have some advantages, such as not requiring that our values be non-null (positive integers for instance, or structs with non-null entries).
This allows us to simply use the sorted array functionality of the mmmultimap.
If max_key
is specified, we first pads the records with one key/null record per key in the range of [0,max_key]
.
Then, we concatenates the thread-specific files we generated by calling append
into one file.
Then, we memory map this file and apply the ips4o in-place parallel super scalar samplesort to the memory mapped buffer to order the records.
When padded (max_key
is specified), the index is completed by building a bitvector of length max_key
, marking 1 for the first instance of a key in the sorted order, and building an auxiliary data structure to support select_1(n)
queries on it that allow us to find the records associated with a given key.
Without this index, we can only iterate through the key/value pairs.
It is now possible to iterate through the records in order with for_each_pair
or in parallel with for_each_pair_parallel
.
We can look up the nth key or value with nth_key
and nth_value
.
And we can iterate over the values or unique values of a given key with for_values_of
and for_unique_values_of
.
mmmultimap
is intended to be used as a library in other applications.
It can be included easily as a CMake ExternalProject.
A test utility is included to demonstrate usage.
To build it:
cmake -H. -Bbuild && cmake --build build -- -j 4
And to run some tests:
bin/mmmultimap -t x -s 10000000 -t 4
10040123 keys
15099011 values
10688272 unique pairs
rm -f x # removes test file
By adding a PMHF to the frontend, it should be possible to project arbitrary key sets into the dense range required by mmmultimap
with only a few bits of overhead per entry.
This would also obviate the need to pad our key space.
Erik Garrison erik.garrison@gmail.com
MIT