The Vector
data structure is meant to be a replacement for Array
when a growable and/or shrinkable data structure is needed.
It provides random access like Array
and Buffer
and can grow and shrink at the end like Buffer
can.
Unlike Buffer
, the memory overhead for allocated but no yet used space is
The package is published on MOPS and GitHub. Please refer to the README on GitHub where it renders properly with formulas and tables.
The API documentation can be found here on Mops.
For updates, help, questions, feedback and other requests related to this package join us on:
The data structure is based on the paper Resizable Arrays in Optimal Time and Space by Brodnik, Carlsson, Demaine, Munro and Sedgewick (1999) which has the following characteristics:
- based on a 2-dimensional array
- persistent memory overhead:
$O(\sqrt{n})$ - worst-case instruction overhead:
$O(\sqrt{n})$ - no re-allocation or copying of data blocks
The implementation is furthermore cycle optimized.
When developing smart contract canisters, to be sure that the canister does not ever run into cycle limits, one has to reason about the code's worst-case complexity. This is even more important as publicly accessible smart contract canisters don't control their input and operate in a potentially adversarial environment. Understanding worst-case behavior is often critical.
The go-to data structure for a resizable array is Buffer
from motoko-base.
Buffer
is a fixed-size array with some reserve capacity that is "grown" when it fills up.
Growing means that the old array is copied into a newly allocated larger array and the old array becomes garbage.
The growing factor is 1.5x.
Therefore, Buffer
has linear persistent memory overhead (1.5x) and
linear worst-case behavior (copying the entire array in the growth event).
The present data structure improves both metrics from linear to Vector
than they do for Buffer
.
For details see the Benchmarking section below.
Vector
is a static type and can therefore be declared stable
.
This is unlike Buffer
which is a class and can not be directly declared stable
.
Vector
provides 40+ convenience functions that are modeled and named after the convenience functions of Buffer
.
This is done to make it as easy as possible to replace Buffer
with Vector.
If a stable
declaration is not required then the package also provides a class version of Vector
.
This can be used as a drop-in replacement for Buffer
as it provides exactly the same interface.
As with Buffer
, the user can benefit from the convenient dot-notation for the class methods.
You need mops
installed. In your project directory run:
mops add vector
In the Motoko source file import the package as one of:
import Vec "mo:vector";
import Vec "mo:vector/Class";
for the static version or the class version, respectively.
import Vector "mo:vector";
let v = Vector.new<Nat>();
Vector.add(v, 1);
Vector.add(v, 2);
Vector.add(v, 3);
assert(Vector.get(v, 0) == 1);
assert(Vector.get(v, 1) == 2);
assert(Vector.get(v, 2) == 3);
Vector.size(v);
Executable version of above example
import Vector "mo:vector/Class";
let v = Vector.Vector<Nat>();
v.add(1);
v.add(2);
v.add(3);
assert(v.get(0) == 1);
assert(v.get(1) == 2);
assert(v.get(2) == 3);
v.size();
Executable version of above example
For tests run:
git clone git@github.com:research-ag/vector.git
mops install
mops test
We extensively benchmarked Vector
against Buffer
and the Motoko-native Array
, where applicable.
Each line in the follwing tables below is one benchmark and corresponds to the given function name.
The benchmarking code can be found here: canister-profiling
The benchmarks were run with dfx 0.20.1 and moc 0.11.2.
This table shows the number of wasm instruction for the given function execution.
For some functions the number of instructions is expected to be independent of the size of the vector,
e.g. init
, get
, getOpt
, put
, size
, clear
, isEmpty
.
However, even in those case we run the function N times and take the average because there may be marginal differences in cost based on the concrete integer value of the index being used.
For some functions the function is run only once for a vector of size N
because a single call iterates through the whole vector,
e.g. addMany
, clone
, indexOf
, firstIndexWith
, lastIndexOf
, lastIndexWith
, forAll
, forSome
, forNone
, iterate
, iterateRev
, vals
, valsRev
, items
, itemsRev
, keys
, iterateItems
, iterateItemsRev
, addFromIter
, toArray
, fromArray
, toVarArray
, fromVarArray
, contains
, max
, min
, equal
, compare
, toText
, foldLeft
, foldRight
, reverse
, reversed
.
The functions add
and removeLast
have sporadic worst-case behavior when the data structure has to grow.
They are therefore run N times and the result is averaged to obtain an amortized cost per call.
N = 100,000
value data type: Nat
method | vector | vector class | buffer | array |
---|---|---|---|---|
init | 15 | 15 | 14 | 14 |
addMany | 17 | 17 | - | - |
clone | 188 | 188 | 298 | - |
add | 336 | 378 | 552 | - |
get | 205 | 247 | 136 | 72 |
getOpt | 261 | 303 | 149 | - |
put | 266 | 309 | 152 | 82 |
size | 183 | 224 | 101 | 69 |
removeLast | 315 | 356 | 395 | - |
indexOf | 182 | 182 | 172 | 56 |
firstIndexWith | 163 | 163 | - | - |
lastIndexOf | 222 | 222 | 179 | - |
lastIndexWith | 203 | 203 | - | - |
forAll | 175 | 175 | 157 | - |
forSome | 163 | 163 | 162 | - |
forNone | 163 | 163 | 162 | - |
iterate | 106 | 106 | 140 | - |
iterateRev | 133 | 133 | - | - |
vals | 156 | 156 | 127 | 20 |
valsRev | 163 | 163 | - | - |
items | 266 | 266 | - | - |
itemsRev | 293 | 293 | - | - |
keys | 105 | 105 | - | - |
iterateItems | 142 | 142 | - | - |
iterateItemsRev | 177 | 177 | - | - |
addFromIter | 406 | 406 | 354 | - |
toArray | 158 | 158 | 117 | - |
fromArray | 163 | 163 | 190 | - |
toVarArray | 226 | 226 | 170 | 114 |
fromVarArray | 163 | 163 | 190 | 64 |
clear | 139 | 180 | 329 | - |
contains | 182 | 182 | 163 | 56 |
max | 174 | 174 | 191 | 57 |
min | 174 | 174 | 197 | 57 |
equal | 356 | 356 | 243 | 133 |
compare | 397 | 397 | 284 | 133 |
toText | 455 | 455 | 398 | 0 |
foldLeft | 163 | 163 | 176 | 69 |
foldRight | 190 | 190 | 192 | 135 |
reverse | 426 | 426 | 243 | 145 |
reversed | 412 | 412 | 243 | 145 |
isEmpty | 116 | 156 | 120 | 88 |
Note:
add
is the function that can grow the data structure. It performs better in amortized terms than forBuffer
because the growth events are cheaper.Vector
only allocates new data blocks, it does not re-allocate and copy old data blocks.get
,put
andgetOpt
are the random access functions.Vector
is a 2-dimensional array where (only) the second dimension has option-values,Buffer
is a 1-dimensional array with option-values andArray
is a 1-dimensional array with non-option-values. Hence, the expected access time forVector
is expected to be roughly the sum of access times forBuffer
andArray
. This is correctly reflected in the numbers.- All functions that iterate through the data structure are optimized in a way that they don't use random access. This is the reason that they are generally only slighly (0-35%) more costly than
Buffer
. In some case the function can be cheaper than forBuffer
(iterate
,max
,min
).
This table shows the heap allocation (persistent and garbage) for the given function execution. The results are for a data structure of size N.
The memory size is generally shown in bytes for a single function execution.
In some cases, the value depends on the size N of the Vector
,
e.g. init
, addMany
, clone
, etc.
In cases when there is an amortized cost such as add
, removeLast
then the function is executed N times
so that one can get an idea of the average.
N = 100,000
value data type: Nat
method | vector | vector class | buffer | array |
---|---|---|---|---|
init | 408688 | 409076 | 400504 | 400008 |
addMany | 408640 | 408640 | - | - |
clone | 425032 | 425420 | 553568 | - |
add | 416060 | 416060 | 1659216 | - |
get | 0 | 0 | 0 | 0 |
getOpt | 0 | 0 | 0 | - |
put | 0 | 0 | 0 | 0 |
size | 0 | 0 | 0 | 0 |
removeLast | 7404 | 7404 | 553112 | - |
indexOf | 28 | 28 | 0 | 0 |
firstIndexWith | 8 | 8 | - | - |
lastIndexOf | 20 | 20 | 0 | - |
lastIndexWith | 0 | 0 | - | - |
forAll | 24 | 24 | 48 | - |
forSome | 8 | 8 | 48 | - |
forNone | 8 | 8 | 48 | - |
iterate | 8 | 8 | 48 | - |
iterateRev | 0 | 0 | - | - |
vals | 172 | 172 | 48 | 0 |
valsRev | 68 | 68 | - | - |
items | 1600104 | 1600104 | - | - |
itemsRev | 1600080 | 1600080 | - | - |
keys | 44 | 44 | - | - |
iterateItems | 8 | 8 | - | - |
iterateItemsRev | 0 | 0 | - | - |
addFromIter | 416060 | 416060 | 1200008 | - |
toArray | 400180 | 400180 | 400024 | - |
fromArray | 408716 | 409104 | 600504 | - |
toVarArray | 400180 | 400180 | 400008 | 400008 |
fromVarArray | 408716 | 409104 | 600504 | 400024 |
clear | 20 | 20 | 40 | - |
contains | 28 | 28 | 48 | 0 |
max | 36 | 36 | 48 | 0 |
min | 36 | 36 | 48 | 0 |
equal | 344 | 344 | 0 | 0 |
compare | 344 | 344 | 0 | 0 |
toText | 3200164 | 3200164 | 3199992 | 296 |
foldLeft | 36 | 36 | 48 | 0 |
foldRight | 28 | 28 | 0 | 0 |
reverse | 0 | 0 | 0 | 400028 |
reversed | 416144 | 416532 | 0 | 400028 |
isEmpty | 0 | 0 | 0 | 0 |
Note:
-
init
andaddMany
create a data structure of size N. Here we see the persistent$\sqrt{N}$ memory overhead forVector
relative toArray
. -
add
shows the garbage creation ofBuffer
due to copying of the entire data block during growth events.Vector
copies only its index block which is in the order of$\sqrt{N}$ . -
removeLast
shows the same effects asadd
but for shrink events. -
items
produces a large amount of garbage because the iterator produces tupels (unlike vals which produces single Nat values in this example). If that is a problem than theiterateItems
function may provide a better alternative for the use case.
The data structure is based on the paper Resizable Arrays in Optimal Time and Space by Brodnik, Carlsson, Demaine, Munro and Sedgewick (1999).
The vector elements are stored in so-called data blocks and the whole data structure consists of a sequence of data blocks of increasing size. Hence it is in fact a two-dimensional array (but not a "square" one).
The trick lies in the selection of the sizes of the data blocks. They are chosen such that the conversion of the externally used single index to the internally used index pair can be cheaply done by bit shifts.
The data block sizes can be better understood when thinking of the data blocks being arranged in "super blocks".
Super blocks are merely a virtual concept and have no manifestation in the implementation.
The capacity of a super block is always a
Hence, the sequence of data block sizes look like this:
where the additional white space indicates super block boundaries.
Each data block is a mutable array of type [var ?X]
where X
is the element type.
The data blocks themselves are stored in the mutable array called data_blocks
.
Hence data_blocks
has type [var [var ?X]]
.
The present implementation differs from the article in that the data block indices are shifted by
Besides the data_blocks
array, the Vector
type constains the index pair i_block
, i_element
which means the next position that should be written by an add
operation:
data_blocks[i_block][i_element]
.
We do not store any more information to reduce memory.
But we also do not store less any information (such as only the total size in a single variable)
as to not slow down access.
When growing we resize data_blocks
(the outer array) so that it can store exactly one next super block. But unused data blocks in the last super block are not allocated, i.e. set to the empty array.
When shrinking we keep space in data_blocks
for two additional super blocks. But unused data blocks in the last two super blocks are deallocated, i.e. set to the empty array.
MR Research AG, 2023-2024
Andrii Stepanov with contributions from Timo Hanke, Andy Gura and react0r-com.
Apache-2.0