Add a changes trie

Question

Add a changes trie

Opened this issue a year ago · 4 comments

This issue is a "pre-RFC". Writing a complete RFC would be time consuming (especially because it requires some technical knowledge about BEEFY that I'm currently lacking), so I would like to gather some feedback first.
The idea of making the changes trie asynchronous (explanations below) comes from @gavofyork.

Context

It is currently notoriously difficult to obtain for example the history of the balance of an account.
The only way to do this in a trustless way currently is to download each block of the entire chain one by one, and check if the balance has been modified by this block. Doing this is an extremely expensive operation that would likely take multiple days.

One elegant way to solve this problem is to store somewhere (explanations below) the history of the changes of each storage key. Doing this is very elegant as it automatically contains the history of everything that happened on the chain, and doesn't need any case-by-case handling by the various pallets.

In order to make it possible for light clients to access this history in a trustless way, the proposed method is to organize this history into a trie whose root hash is then made trustless. This trie is called the changes trie.

History

Substrate used to have a changes trie, but it was removed due to being too heavy on I/O operations. In other words, this was slowing down blocks production and verification too much.

In order to solve this problem, this RFC proposes to make validators construct the changes trie and then vote for it through the exact same mechanism as BEEFY votes for an MMR root.

In order to make it possible for validators to vote on a single changes trie root, this changes trie need to cover the entire history of the chain, rather than only one block like was the case for the previous changes trie.

Details

Here is a list of the changes:

At block 0, the changes trie is filled with the same entries as the state trie, except that keys are modified to be concat(":current", key_of_state_trie) and that values are set to 0 (the block number where this key was last modified).
For each block between the BEEFY head and the BEEFY voting target, validators do this:
- For each key of the state that has been newly added, add a changes trie entry to concat(":current", key_of_state_trie) and set its value to block_number.
- For each key of the state that has been modified, set the changes trie entry at concat(block_number, key_of_state_trie) to the value at concat(":current", key_of_state_trie), then set the changes trie entry at concat(":current", key_of_state_trie) to block_number.
- For each key of the state that has been removed, remove the changes trie entry at concat(":current", key_of_state_trie).
When validators perform their BEEFY voting, they also exchange the changes trie root hash of the block that they are voting for.
Once a block has been "BEEFY-finalized", in order to save space, validators can remove from their database all entries that start with block_number and keep only the key and hash of the subtree that they have removed. Archive nodes (as defined in #59) do not remove the subtree from their database.

In order words, the changes trie would contain a subtree of prefix :current, and one subtree for each block number. The subtree of prefix :current contains the latest block where each key of the state has last been modified, and the subtree of each block number contains, for each key, the previous block number where it has been modified.
This scheme makes it possible to progressively iterate down block numbers, looking up only blocks where the entry has effectively been modified.

The exact keys and encoding of values are all details that aren't important at the moment. The block number could be encoded in a way to make it possible to strip chunks of block numbers from the database entire, for example you could replace the changes of all blocks between 0 and 100000 with a single hash.

Drawbacks:

It's not possible for example to obtain the history of the balance of an account that has been removed. Maybe we could add, within the :current subtree and at various branches, the block number where an entry has last been removed.

Unresolved questions:

If the runtime modifies a value to the same value as it already had (i.e. does something like storage_set(key, storage_get(key))), do we update the changes trie or not?

As I'm not very familiar with BEEFY, I hope that what is suggested is possible.

Not covered by this issue is the fact that we would add a networking protocol to make it possible to query the content of the changes trie from archive nodes through the peer-to-peer network.

Answer 1 · 2024-01-18T14:30:26.000Z

Ideally, we'll move accounts off the relay chain eventually, so then overall validators would no longer have special information about accounts changes, only about when parachains made blocks. It's maybe faster to querty parachain nodes to find out when the parachain made blocks.

Instead, you could've each account contain the tuple (my_parent,previous_parent) so that anytime you touch an account you set previous_parent = my_parent and set my_parent to be the block hash of the block upon which you're building. This still works on parachains. I think *_parents could be a block height, not a block hash.

It's also possible to update the account without touching these for specific common events, like rewards, but then instead users track their rewards blocks by tracking the validators they nominate in a similar manor.

Answer 2 · 2024-01-18T15:40:04.000Z

Anyways, there are other reasons to have off-chain database maintained by collators but never tracked even by the parachain, or maybe tracked only by collators but never validated:

As an example, we know the current trie does many dumb things, like storing data in internal nodes, radix >2, etc. and stuff like Cosmos' somewhat flat trie is much better. If you've some new storage type without non-membership proofs, then accounts could be stored at random leaves in a flat depth binary merkle tree. We then have collators maintain a seperate non-merklized data structure of which account lives at which index. This could be more efficent than anything used by other chains.

As a similar optimization, collators could track account changes, either like this RFC discusses, or by storing history in this off-chain non-merklized data structure, or by having a seperate merklaized data structure for these (my_parent,previous_parent) entries, but which only collators track, not validators.

In what threat model do we prioritize users locating their account history quickly? It's a liveness concern so yes we could definitely leave this up to collators, but this requires off-chain data structures maintained by collators in parallel to the parablocks. It's does not require relay chain changes, but does require lite client changes.

Answer 3 · 2024-01-19T19:47:13.000Z

So you'd only be able to query changes on and archive node (since full nodes would prune almost all changes up to finalisation)?

Answer 4 · 2024-01-22T08:24:48.000Z

So you'd only be able to query changes on and archive node (since full nodes would prune almost all changes up to finalisation)?

The changes trie in itself is not very useful if you then can't read the storage, so to me it makes sense that only archive nodes keep the full changes trie.

In #59 I suggest that full nodes should be able to serve the storage of the finalized block minus 16.
In my opinion something similar should be done for the changes trie.

After a quick discussion with @andresilva, it might be better to put the changes trie root in the MMR itself as a field of the leaves.