Enumeration<K>
implements an add-only set of elements of type K where the
elements are numbered in the order in which they are added to the set.
The elements are called keys and a key's number in the order is called index.
Lookups are possible in both ways, from key to index and from
index to key.
Memory space is O(n)
.
Time complexities are:
- add a key:
O(log n)
average,O(n)
worst-case - lookup a key:
O(log n)
- lookup an index:
O(1)
The data structure is optimized primarily for memory efficiency
and secondarily for instruction efficiency (time).
Furthermore, instruction efficiency is optimized for lookup speed over addition speed.
The O(n)
worst-case for adding a key is a tradeoff accepted for a constant gain in lookup speed.
The package is published on MOPS and GitHub. Please refer to the README on GitHub where it renders properly with formulas and tables.
The API documentation can be found here on Mops.
For updates, help, questions, feedback and other requests related to this package join us on:
Canister services often track their users by principal. Per-user data is then stored in a tree or hashmap where the key is the principal and the value is the user data.
The motivation of this data structure is to enumerate the users in the order that they were registered.
This result in a permanent user number for each user.
Instead of a tree, the user data can then be stored in a linear structure such as Buffer or Vector which has O(1)
access.
To this end, the present data structure provides an "enumerated set" of keys where the key type K
is a type parameter (e.g. Principal
).
The set elements (keys) are consecutively numbered 0,1,2,.. in the order in which they are added.
Keys cannot be deleted.
The lookup from a key to its number is tree based and the time complexity is O(log n)
.
However, the advantage of taking this approach is that this one lookup in Enumeration
can be the only lookup in the entire canister that has complexity O(log n)
.
All subsequent accesses to data structures, by being based on the key's number instead of the key, can be O(1)
.
You need mops
installed. In your project directory run:
mops add enumeration
In the Motoko source file import the package as:
import Enumeration "mo:enumeration";
let e = Enumeration.Enumeration<Blob>(Blob.compare, "");
e.add("abc"); // -> 0
e.add("aaa"); // -> 1
e.add("abc"); // -> 0
e.lookup("aaa"); // -> ?1
e.get(0); // -> "abc"
e.get(1); // -> "aaa"
You should have the latest moc
installed and know the executable's path.
Then run:
git clone git@github.com:research-ag/enumeration.git
mops install
DFX_MOC_PATH=<path-to-moc> mops test
Benchmarking code can be found here: canister-profiling
We compare Enumeration<K>
against various other maps of type K -> Nat
. The functionality of these other maps is not exactly the same but sufficiently overlaps with Enumeration that a comparison is possible. It should be noted that the other maps do not have the inverse map Nat -> K
like Enumeration does. On the other hand, they offer deletion which Enumeration does not. However, the map K -> Nat
is tree-based in all cases hence we can compare insertion and lookup operations both in terms of instructions and memory used.
For the memory benchmark we insert N random entries of a given type K
into Enumeration.
We compare that against an array of length N with the same entries and take the difference.
This tells us the memory overhead that the data structure has over an array and eliminates the effect of boxing, which depend on type and value.
We finally divide by N to get the memory overhead per entry.
The results are as follows (N = 4,096), unit is bytes per entry:
btree | enumeration | rb_tree | map v7 | map v8 |
---|---|---|---|---|
20.9 | 24 | 48 | 36* | 52 |
For example, if K
is one of Nat32, Nat64, Nat
and the values are not boxed (e.g. < 2^30) then we know that in an array each such entry is responsible for 4 bytes on the heap.
Hence, by the table, each such entry in an Enumeration<K>
is responsible for 4 + 24 = 28
bytes on the heap.
As another example, we know that a 32-byte Blob
as an array entry is responsible for 32 + 12 = 44
bytes.
Hence, adding a 32-byte entry to an Enumeration<Blob>
will allocate 44 + 24 = 68
bytes on the heap.
_* Note: map v7 has shown an unexpected type dependency which we have not investigated further.
The given result is for type Blob
.
The data structure may be more efficient for type Nat32
.
For the time benchmark we use 32 byte Blob
s as keys.
We insert N random blobs into the map.
Then we look up several of the keys that are actually in the map ("hits") and take the average.
The results are as follows (N = 4,096), unit is instructions per lookup:
btree | enumeration | rb_tree | map v7 | map v8 | |
---|---|---|---|---|---|
hits | 5107 | 3241 | 2889 | 2226 | 2111 |
misses | 5107 | 2704 | 2310 | 2226 | 2111 |
Notes:
- The hashmaps (v7, v8) are faster than the data structures that do not use hashes (all others).
- Hits are more expensive in the rb-tree based data structures because the final comparison, if it is a match, has to compare the full 32 bytes.
The underlying data structure for the map Nat -> K
is an array that holds reserve space and is grown with an algorithm similar to Buffer.
The underlying data structure for the map K -> Nat
is a red-black tree.
The two data structures are combined into one for memory efficiency.
In particular, each key is stored only once, not twice.
This is particularly important if the keys are long (e.g. Principal
s).
The red-black tree is in fact a tree from Nat -> Nat
and the comparison operation first looks up the keys in the array and then compares those.
This makes lookups by key slightly slower than in the RBTree from motoko-base
(as seen in the benchmark)
but is a necessary trade-off to achieve memory efficiency.
Shrinking of the array and key deletion in the red-black tree are not implemented because Enumeration does not allow key removal.
MR Research AG, 2023
Main author: Andrii Stepanov (AStepanov25)
Contributors: Timo Hanke (timohanke), Yurii Pytomets (Pitometsu)
Apache-2.0