OpenZeppelin/openzeppelin-contracts

Add an Enumerable Set data structure

Closed this issue · 24 comments

The OwnerManager contract from Gnosis safe is basically an implementation for a LinkedList with O(1) check for contains(item). It's a mapping with the same type for keys and values (addresses in this case), where each item points to the next one in the mapping.

Could be interesting to abstract this and build it into OZ. Note that, given the lack of generics in Solidity, we may need to provide the same contract for different data types.

  • 📈 This is a feature request.

I think we should start by providing the linked list for addresses only (likewise for other data structures). It should cover most use cases.

I think this is a pretty cool idea. Has anyone worked on this yet?

Hi guys,

I did an implementation for Singly and Doubly Linked Lists, which I would be happy to transfer to your repo. These contracts have been thoroughly tested:

https://github.com/HQ20/contracts/tree/dev/contracts/lists

I have also an implementation for an Ordered Doubly Linked list in the same directory, but one of the methods is O(N) so it might not be safe enough for normal use.

I also have an implementation of a Linked List using OOP principles, with each element in the list being a contract. Nice and clear implementation, with horrid gas costs. I can point you to it as well if you are interested.

I'd love to hear your feedback, thanks.

Interesting, thanks @albertocuestacanada! We've seen many projects roll out their own collections, sometimes with bad results (@ElOpio mind chiming in on this?).

About your specific implementation, we'd probably prefer to have the collections be a library that operates on some custom struct, like Roles does, letting users have multiple collections in the same contract.

Thank you @albertocuestacanada!

We've defined clearer requirements for this feature for whoever wants to open a pull request. The focus of the issue is on building an enumerable set, rather than a generic linked list. The linked list is simply the way the set should be implemented.

  • Since there are no generics we should first build an enumerable address set, which should cover most uses cases.
  • We want this to be built as a library with a struct so that a contract can have multiple sets.
  • The available operations should be: append(addr) (insert at the end of the list), remove(addr), contains(addr), and enumerate (returning all the elements as an array).

An enumerable address set implemented as a library was not too far from what I've got already. It's drafted already (I'll need to write tests for two functions).

Do I need permissions to create a branch in your repo and push the code?

Cool! Fork this repo and create a branch in the fork, then open a PR from there.

Apologies for missing that critical line in CONTRIBUTING.md @frangio

Here is the draft of Enumerable.sol.

I left it in drafts, not sure where would you prefer to put it.

Check if it is what you are after. If so I'll write the tests and have a think about performance.

It was developed from a Doubly Linked List, but since you need less features there is probably stuff to optimize.

@nventuro yes, we have seen many projects copying or reimplementing their own collections. It would be nice to start a do-group to list all of these, maybe in a wiki topic in our forum, to analyze the different approaches and maybe even get people working together. In the ideal case, this would be a push to get generics into the language.

About the road forward, I agree with @frangio's approach to start with a set of addresses. This is immediately useful to many use cases, and will teach us about limitations and possible ways to generalize to other types and underlying structures.

@albertocuestacanada Is it possible to use addresses as ids? I was imagining something much simpler than what you shared, but I don't know if there are reasons why it can't be as simple as a mapping (address => address) (plus head and tail), as opposed to also having ids and Objects.

Hi @frangio, as long as you use a linked list implementation I don't think that you can use addresses as ids in a meaningful way.

addresses as ids

In a linked list you need to record at least two data points per element: the data that needs to be stored and some way of retrieving the next element. This usually points to using a struct and mapping like these:

struct Element {
    address data;
    uint256 next
}

mapping (uint256 => Element)

next is an identifier to retrieve the next struct from the mapping, since we can't have recursive structs. Using an address for ids instead of an uint256 would mean that when someone appends a new element in the list they need to come up with an unique address as an id, or we need to have the contract generate the address used as an id randomly. Neither case is very usable.

I'm using uint256 because I need ids and to generate them the easiest is to use a counter (I can reuse Counters from your library, btw).

complex implementation

Using a doubly linked list for an enumerable set might be overkill, since you don't need arbitrary insertion.

For the Enumerable Set that you requested, there are two implementations that are simpler.

  1. If you can live with some fragmentation you can use this implementation. It only needs to implement an enumerate function out of repeated calls to next.

  2. If you don't need to keep the elements in the insertion order, you can use the this implementation again, but this time also modify the remove function to do something like this and avoid fragmentation.

The implementation also becomes more complex from the requirement of coding it as a library.

next steps

I think that we need to think a bit on the requirements:

  • Do we just want an enumerable set, where the elements might be returned in a different order each time we enumerate them?. contains and enumerate will be O(N) where N is the number of elements in the set.
  • Do we want an enumerable set that respects the insertion order? Then enumerate and contains are O(N*) where N* is potentially the number of elements that have ever been added to the set.
  • Do we want an enumerable set that respects the insertion order, and where all functions are O(1)? Then we use a Doubly Linked List, we force the user to call next instead of implementing enumerate, and implement an index for contains using a mapping.

@ElOpio, if you give me some time I'm writing an article comparing the different implementations, it might be useful. I've got already some of the work done.

@frangio, also, what do you guys use to deploy contracts and link artifacts in this repo? There is nothing in the migrations or .openzeppelin folders.

In a linked list you need to record at least two data points per element: the data that needs to be stored and some way of retrieving the next element. This usually points to using a struct and mapping like these:

struct Element {
   address data;
   uint256 next
}

mapping (uint256 => Element)

next is an identifier to retrieve the next struct from the mapping, since we can't have recursive structs. Using an address for ids instead of an uint256 would mean that when someone appends a new element in the list they need to come up with an unique address as an id, or we need to have the contract generate the address used as an id randomly. Neither case is very usable.

I'm using uint256 because I need ids and to generate them the easiest is to use a counter (I can reuse Counters from your library, btw).

I think @frangio wasn't proposing a regular Node { data, next } structure where next is of type address, but instead have next itself be data. This fits nicely with the idea of the set: each address can only be stored once. Additionally, you'd get O(1) contains, since instead of searching through the whole list, you can access it's entry directly. enumerate and friends will obviously continue to be O(n).

In this sense, what we aren't building isn't a list, but rather an enumerable set: the list is simply the tool we use to get ordering. This is similar to what you mention here:

Do we want an enumerable set that respects the insertion order, and where all functions are O(1)? Then we use a Doubly Linked List, we force the user to call next instead of implementing enumerate, and implement an index for contains using a mapping.

Thinking out loud, we may need a sentinel value to handle the tail element, since a value would be considered to be in set if its entry is not empty, but this value is also next.


Regarding ordering, as far as I know the term enumerable implies sorting, so I'd say yes.

@frangio, also, what do you guys use to deploy contracts and link artifacts in this repo? There is nothing in the migrations or .openzeppelin folders.

I'm not sure what you mean by deployments: all of OpenZeppelin Contracts libraries have only internal functions, which means they don't need to be deployed and linked. Their code is instead inlined when used.

Ok, I understand how you want to implement the EnumerableSet with mapping(address => address). That just means that all values in the set are unique, so we can have [a, b, c] but not [a, a, b].

We can either have a sentinel value for tail, or specify that no set can contain address(0).

contains and append would be O(1), enumerate and remove O(N).

An insertAfter function can also be implemented, with O(1). You don't need to restrict yourself to appending at the end.

Finally, you can have remove to be O(1) if you have two mappings:

mapping(address => address) forward
mapping(address => address) backward

It would double the gas cost of append (and insert if implemented), but the behaviour would be more predictable. Implementation will also be easier.

My only question at this point would be: Do we use a tail variable, or forbid adding address(0) to the set?

I can do this next week, no problem.

It took me a bit less to do than expected: Library, mock and tests.

I've got to move the contract to a fork of your repo, and convert the tests to your framework, but for now I've done it where I'm most comfortable.

What do you guys think?

It looks good! It's definitely what I had in mind. Keep in mind that library functions should be internal, and the helpers used for implementation should be private.

I have a few items I'd like to discuss though.

@nventuro and I have an intuition that the library shouldn't emit any built-in events. Events should be left up to the user of the library. What do you think?

I also don't like that enumerate requires iterating two times. It feels like we could also keep track of the length to avoid that extra iteration. I'm not sure what is the better tradeoff.

Another thing that I think is worth discussing a bit more is whether to make it doubly linked or not. Something interesting about the Gnosis Safe posted in OP is that it is singly linked and remove has an argument for specifying the previous element so that you can do the removal in O(1) without iterating from the head. The same API can be generalized to providing a "hint" that can be any element that is before in the list, and in the worst case you can provide the head and remove in O(n). I find this part of the design space very interesting and uniquely suited to blockchain because the magnitude of the n will be insignificant when seen off-chain. The feasibility of this approach depends on the off-chain components being able to find the previous element efficiently though, and I'm not sure about that. Events sound like the ideal way to do this but on the other hand I'm proposing to remove the built-in events... I'm interested in hearing your thoughts on this. Particularly @spalladino, you may be able to provide a more full-stack view.

Lastly, this one is a bit more radical but there is another pattern for implementing enumerable sets which is using an array address[] and a mapping address => uint where you store the index for each address. I haven't had time to compare these two but it seems to me like they provide the same functionality, and this one may have better characteristics.

@frangio, I've corrected the visibility, and also managed to get rid of everything related to the head and tail variables.

Now next[address(0)] points to the head, and prev[address(0)] points to the tail, making the whole thing quite elegant.

enumerate

On enumerate requiring two loops there is indeed a tradeoff. If you keep track of the length each append call will store an extra word. Currently there append stores 3 words, so that would be a 33% more gas roughly if we keep a length counter, remove would cost about twice the gas with the counter.

Apart from that, append and remove are transactional and enumerate is a view. If enumerate is going to be mostly called from outside the contract I wouldn't fret about it looping twice.

On this tradeoff I have no opinion, it depends on how you envision enumerate being used.

double or single linked
If this is made single linked then I'll need to keep a tail variable (to know where to append), so an append operation would store two words instead of the current three. The performance of remove with hints wouldn't be too bad, but usability of that option would be poor. On this tradeoff I would vote against single linked.

address[] and address => uint
I assume here that address[] is where the data is stored, and address => uint is the order in which the elements should be placed in the enumerable array. contains and append are O(1), remove is O(N) unless you keep another mapping to tell you where in the address[] array you can find each specific address. I think that on remove you could do this to avoid fragmentation, in which case enumerate would be the same O(N).

Up to here, it seems to me that it has the same performance and features as the current implementation.

A slightly better implementation is using the address[] for the items, as before, but using an address => uint[] for the indexes. That way you wouldn't be restricted to unique items in your set. I can't think of a downside to this (except code complexity).

If you want, I can do an implementation of this pattern as well. It's fun 🤓

Great analysis, thank you. I agree that the consideration that enumerate is going to be called off-chain is an important one, and looping twice is not a big deal in that context.

We should be clear that for this feature we are only interested in storing unique items, and we are not interested in their ordering in the array. In the alternative implementation, address => uint maps each address to the index where it can be found in the address[] array. This results in O(1) remove by replacing the removed element with the last one and updating its index in the mapping. Based on this, do you still think the linked list is preferable?

@frangio, you should have told me that you were just interested in a canonical implementation of a Set. We would have got here a lot faster :)

For an unordered data set using a linked list makes no sense. The approach that you mention: storing values in an array, infilling on remove, and using an index to have O(1) for contains seems the right approach.

Besides, it's a very simple implementation, which is always good.

Contract, mock and tests..

Finally I managed to get the tests running on the fork of your repo.

Contract, mock and tests.

Where should this go? A types folder inside contracts?

Awesome. I would say let's put it in utils.

Done.

I've got a linting error: Expected an assignment or function call and instead saw an expression
On the last line of each test, like this one: expect(await this.set.testContains(a)).to.be.false;

I would think that expect(...) is a function call 🤔

The issue is to.be.false: that's one of the reasons why we prefer to.equal(false).

@frangio do you think its time to graduate this issue into a PR?

Thanks @nventuro, linting fixed now 👍

Yes @albertocuestacanada please submit a PR.