Vector data structure for Motoko

Overview

The Vector data structure is meant to be a replacement for Array when a growable and/or shrinkable data structure is needed. It provides random access like Array and Buffer and can grow and shrink at the end like Buffer can. Unlike Buffer, the memory overhead for allocated but no yet used space is $O(\sqrt{n})$ instead of $O(n)$.

Links

The package is published on MOPS and GitHub. Please refer to the README on GitHub where it renders properly with formulas and tables.

The API documentation can be found here on Mops.

For updates, help, questions, feedback and other requests related to this package join us on:

Characteristics

The data structure is based on the paper Resizable Arrays in Optimal Time and Space by Brodnik, Carlsson, Demaine, Munro and Sedgewick (1999) which has the following characteristics:

based on a 2-dimensional array
persistent memory overhead: $O(\sqrt{n})$
worst-case instruction overhead: $O(\sqrt{n})$
no re-allocation or copying of data blocks

The implementation is furthermore cycle optimized.

Motivation

When developing smart contract canisters, to be sure that the canister does not ever run into cycle limits, one has to reason about the code's worst-case complexity. This is even more important as publicly accessible smart contract canisters don't control their input and operate in a potentially adversarial environment. Understanding worst-case behavior is often critical.

The go-to data structure for a resizable array is Buffer from motoko-base. Buffer is a fixed-size array with some reserve capacity that is "grown" when it fills up. Growing means that the old array is copied into a newly allocated larger array and the old array becomes garbage. The growing factor is 1.5x. Therefore, Buffer has linear persistent memory overhead (1.5x) and linear worst-case behavior (copying the entire array in the growth event).

The present data structure improves both metrics from linear to $O(\sqrt{n})$ in exchange for less performant random access. Since the underlying data structure is a 2-dimensional array, put and get operation become approximately twice as expensive. However, the implementation is highly optimized so that in practice it is less than 2x in practice. Several convenience functions operate even faster for Vector than they do for Buffer. For details see the Benchmarking section below.

Interface

Vector is a static type and can therefore be declared stable. This is unlike Buffer which is a class and can not be directly declared stable.

Vector provides 40+ convenience functions that are modeled and named after the convenience functions of Buffer. This is done to make it as easy as possible to replace Buffer with Vector.

If a stable declaration is not required then the package also provides a class version of Vector. This can be used as a drop-in replacement for Buffer as it provides exactly the same interface. As with Buffer, the user can benefit from the convenient dot-notation for the class methods.

Usage

Install with mops

You need mops installed. In your project directory run:

mops add vector

In the Motoko source file import the package as one of:

import Vec "mo:vector";
import Vec "mo:vector/Class";

for the static version or the class version, respectively.

Example

import Vector "mo:vector";

let v = Vector.new<Nat>();
Vector.add(v, 1);
Vector.add(v, 2);
Vector.add(v, 3);
assert(Vector.get(v, 0) == 1);
assert(Vector.get(v, 1) == 2);
assert(Vector.get(v, 2) == 3);
Vector.size(v);

Executable version of above example

import Vector "mo:vector/Class";

let v = Vector.Vector<Nat>();
v.add(1);
v.add(2);
v.add(3);
assert(v.get(0) == 1);
assert(v.get(1) == 2);
assert(v.get(2) == 3);
v.size();

Executable version of above example

Build & test

For tests run:

git clone git@github.com:research-ag/vector.git
mops install
mops test

Benchmarks

We extensively benchmarked Vector against Buffer and the Motoko-native Array, where applicable. Each line in the follwing tables below is one benchmark and corresponds to the given function name.

The benchmarking code can be found here: canister-profiling

The benchmarks were run with dfx 0.20.1 and moc 0.11.2.

Time

This table shows the number of wasm instruction for the given function execution.

For some functions the number of instructions is expected to be independent of the size of the vector, e.g. init, get, getOpt, put, size, clear, isEmpty. However, even in those case we run the function N times and take the average because there may be marginal differences in cost based on the concrete integer value of the index being used.

For some functions the function is run only once for a vector of size N because a single call iterates through the whole vector, e.g. addMany, clone, indexOf, firstIndexWith, lastIndexOf, lastIndexWith, forAll, forSome, forNone, iterate, iterateRev, vals, valsRev, items, itemsRev, keys, iterateItems, iterateItemsRev, addFromIter, toArray, fromArray, toVarArray, fromVarArray, contains, max, min, equal, compare, toText, foldLeft, foldRight, reverse, reversed.

The functions add and removeLast have sporadic worst-case behavior when the data structure has to grow. They are therefore run N times and the result is averaged to obtain an amortized cost per call.

N = 100,000
value data type: Nat

method	vector	vector class	buffer	array
init	15	15	14	14
addMany	17	17	-	-
clone	188	188	298	-
add	336	378	552	-
get	205	247	136	72
getOpt	261	303	149	-
put	266	309	152	82
size	183	224	101	69
removeLast	315	356	395	-
indexOf	182	182	172	56
firstIndexWith	163	163	-	-
lastIndexOf	222	222	179	-
lastIndexWith	203	203	-	-
forAll	175	175	157	-
forSome	163	163	162	-
forNone	163	163	162	-
iterate	106	106	140	-
iterateRev	133	133	-	-
vals	156	156	127	20
valsRev	163	163	-	-
items	266	266	-	-
itemsRev	293	293	-	-
keys	105	105	-	-
iterateItems	142	142	-	-
iterateItemsRev	177	177	-	-
addFromIter	406	406	354	-
toArray	158	158	117	-
fromArray	163	163	190	-
toVarArray	226	226	170	114
fromVarArray	163	163	190	64
clear	139	180	329	-
contains	182	182	163	56
max	174	174	191	57
min	174	174	197	57
equal	356	356	243	133
compare	397	397	284	133
toText	455	455	398	0
foldLeft	163	163	176	69
foldRight	190	190	192	135
reverse	426	426	243	145
reversed	412	412	243	145
isEmpty	116	156	120	88

Note:

add is the function that can grow the data structure. It performs better in amortized terms than for Buffer because the growth events are cheaper. Vector only allocates new data blocks, it does not re-allocate and copy old data blocks.
get, put and getOpt are the random access functions. Vector is a 2-dimensional array where (only) the second dimension has option-values, Buffer is a 1-dimensional array with option-values and Array is a 1-dimensional array with non-option-values. Hence, the expected access time for Vector is expected to be roughly the sum of access times for Buffer and Array. This is correctly reflected in the numbers.
All functions that iterate through the data structure are optimized in a way that they don't use random access. This is the reason that they are generally only slighly (0-35%) more costly than Buffer. In some case the function can be cheaper than for Buffer (iterate, max, min).

Memory

This table shows the heap allocation (persistent and garbage) for the given function execution. The results are for a data structure of size N.

The memory size is generally shown in bytes for a single function execution.

In some cases, the value depends on the size N of the Vector, e.g. init, addMany, clone, etc.

In cases when there is an amortized cost such as add, removeLast then the function is executed N times so that one can get an idea of the average.

N = 100,000
value data type: Nat

method	vector	vector class	buffer	array
init	408688	409076	400504	400008
addMany	408640	408640	-	-
clone	425032	425420	553568	-
add	416060	416060	1659216	-
get	0	0	0	0
getOpt	0	0	0	-
put	0	0	0	0
size	0	0	0	0
removeLast	7404	7404	553112	-
indexOf	28	28	0	0
firstIndexWith	8	8	-	-
lastIndexOf	20	20	0	-
lastIndexWith	0	0	-	-
forAll	24	24	48	-
forSome	8	8	48	-
forNone	8	8	48	-
iterate	8	8	48	-
iterateRev	0	0	-	-
vals	172	172	48	0
valsRev	68	68	-	-
items	1600104	1600104	-	-
itemsRev	1600080	1600080	-	-
keys	44	44	-	-
iterateItems	8	8	-	-
iterateItemsRev	0	0	-	-
addFromIter	416060	416060	1200008	-
toArray	400180	400180	400024	-
fromArray	408716	409104	600504	-
toVarArray	400180	400180	400008	400008
fromVarArray	408716	409104	600504	400024
clear	20	20	40	-
contains	28	28	48	0
max	36	36	48	0
min	36	36	48	0
equal	344	344	0	0
compare	344	344	0	0
toText	3200164	3200164	3199992	296
foldLeft	36	36	48	0
foldRight	28	28	0	0
reverse	0	0	0	400028
reversed	416144	416532	0	400028
isEmpty	0	0	0	0

Note:

init and addMany create a data structure of size N. Here we see the persistent $\sqrt{N}$ memory overhead for Vector relative to Array.
add shows the garbage creation of Buffer due to copying of the entire data block during growth events. Vector copies only its index block which is in the order of $\sqrt{N}$.
removeLast shows the same effects as add but for shrink events.
items produces a large amount of garbage because the iterator produces tupels (unlike vals which produces single Nat values in this example). If that is a problem than the iterateItems function may provide a better alternative for the use case.

Design

The data structure is based on the paper Resizable Arrays in Optimal Time and Space by Brodnik, Carlsson, Demaine, Munro and Sedgewick (1999).

The vector elements are stored in so-called data blocks and the whole data structure consists of a sequence of data blocks of increasing size. Hence it is in fact a two-dimensional array (but not a "square" one).

The trick lies in the selection of the sizes of the data blocks. They are chosen such that the conversion of the externally used single index to the internally used index pair can be cheaply done by bit shifts.

The data block sizes can be better understood when thinking of the data blocks being arranged in "super blocks". Super blocks are merely a virtual concept and have no manifestation in the implementation. The capacity of a super block is always a $2$-power. The $i$-th super block has capacity $2^i$ and consists of $2^{\lfloor i / 2\rfloor}$ data blocks of size $2^{\lceil i / 2 \rceil}$. This is followed by the next super block of capacity $2^{i+1}$ and so on.

Hence, the sequence of data block sizes look like this:

$$1,\ \ 2,\ \ 2,2,\ \ 4,4,\ \ 4,4,4,4,\ \ 8,8,8,8,\ \ ...$$

where the additional white space indicates super block boundaries.

Implementation notes

Each data block is a mutable array of type [var ?X] where X is the element type. The data blocks themselves are stored in the mutable array called data_blocks. Hence data_blocks has type [var [var ?X]].

The present implementation differs from the article in that the data block indices are shifted by $2$ and we introduce two data blocks of size $0$ and $1$ at the beginning of the sequence. This makes the access faster because it eliminates the frequent computation of $i+2$ in the internal formulas needed for index conversion.

Besides the data_blocks array, the Vector type constains the index pair i_block, i_element which means the next position that should be written by an add operation: data_blocks[i_block][i_element]. We do not store any more information to reduce memory. But we also do not store less any information (such as only the total size in a single variable) as to not slow down access.

When growing we resize data_blocks (the outer array) so that it can store exactly one next super block. But unused data blocks in the last super block are not allocated, i.e. set to the empty array.

When shrinking we keep space in data_blocks for two additional super blocks. But unused data blocks in the last two super blocks are deallocated, i.e. set to the empty array.

Copyright

MR Research AG, 2023-2024

Authors

Andrii Stepanov with contributions from Timo Hanke, Andy Gura and react0r-com.

License

Apache-2.0