Backward compatibility of the Log
vvp opened this issue ยท 5 comments
As OrbitDB version 0.20.0 (orbitdb/orbitdb#524) is getting nearer and work on IPLD support (#200) has started, it would be a good time to discuss about the backward compatibility of the log. Currently there is not much:
- Common way of dealing with backward compatibility is to add versioning information to any data structures that live longer than software versions. However, logs or their entries contain no versioning information. entry-struct has
v
-field but it cannot be used to differentiate between incompatible versions because it's currently always 0. - There are no tools, procedures, or mechanisms in-place to resolve the structural and semantic incompatibility between two logs. Therefore to upgrade to backward incompatible ipfs-log version, users have no choice but to implement the incompatibility resolution by themselves which will be either a lot of work or just impossible.
For example, in next release there's a new identity
field in entry-structures. Current version expects it to be there when entries are loaded from IPFS, and access-controller will actually fail if there's no identity
information in entries to append. All the log entries created with previous versions will not have this information. Fortunately, this check is done only on new items appended/joined into log, so appending new entries to old logs will still work after version upgrade.
Some design aspects that I see:
- How to define the version information itself? Monotonically increasing number, semver/calver, ipfs-log-version, or something else? I think we just need to differentiate between two log versions if they have structural or semantic differences. Many database migration schemes use partially ordered versioning which is ok too.
- Versioning the log+entries, vs versioning just the log? Just versioning the log would probably have lower overhead (no redundant versioning in entries, fail-fast version checks in
LogIO.fromMultihash()
), whereas versioning the entries too would allow joining logs with different versions together and be more flexible in backward compatibility. Single-version log would probably need to have an internal version-specificlogId
which would then have consequences on entries' log references too. - Should there be support for multiple versions on code level, or require that older log versions need to be migrated to the single code-supported version first? Supporting multiple log/entry versions can make the development quite troublesome and error-prone, whereas requiring migrations will make the upgrading process more involved (especially with larger logs).
- Could we use the upgrade/migration mechanism to help users with payload versioning as well?
Any thoughts, opinions? ๐
I'll kick off the discussion with a proposal that we use a monotonically increasing version
field inside of the entries themselves, the absence of which to be treated like the value 1
. The field's explicit value will start at 2.
This has the benefit of freeing the entry versions from having to be in lock-step with the package version, and gives us all the added benefit of being able to join logs of different versions. If possible, best to leave the old entries where they are instead of recreating/duplicating them in a migration.
Entries without identities: leave as public?
Lots of great thoughts here, thank you @vvp and @aphelionz!
monotonically increasing version field inside of the entries themselves
Agreed. We have this as v
field now as @vvp mentioned, which is set to 0 atm. For starters, we should increase the version number to 1 :)
Entries without identities: leave as public?
Old versions also have a signature, under .key
field in the data structure, which maps to identity.publicKey
in the new version.
Should there be support for multiple versions on code level, or require that older log versions need to be migrated to the single code-supported version first? Supporting multiple log/entry versions can make the development quite troublesome and error-prone, whereas requiring migrations will make the upgrading process more involved (especially with larger logs).
This is very true. I don't think we can "migrate" the logs in a way that the actual entries will be converted to the new structure due to the signatures in each entry. Which, I believe, leaves us with the second option of supporting multiple versions. However, as you say @vvp, this can make the code quite complex and highly error-prone, so it seems to me that the question is:
Do we want to or need to support multiple versions? If not, what are the consequences to users? If yes, what are the consequences to development and for maintainers (eg. do we commit to support all versions of logs from today all the way to the far future)?
Is there a way we could provide migration tools in a way that the end-user initiates and authorizes the migration (ie. they re-sign all converted/transformed entries) instead of developers building on/with orbitdb?
This is very true. I don't think we can "migrate" the logs in a way that the actual entries will be converted to the new structure due to the signatures in each entry. Which, I believe, leaves us with the second option of supporting multiple versions.
I've been thinking about the same thing but in https://github.com/peer-base/peer-base land and this is the way to go. There's always the latest canonical version of the data-structure and we must convert old versions to the canonical version when reading. This means we must tag those data structures with the versions and have code to migrate to the latest version incrementally.
Also, I would think that embracing conventional commits would improve the visibility of changes to developers. Many projects in the IPFS land already use them. You may check how to quickly setup the necessary tooling on some repos, for instance, this one. Basically:
- Setup husky to lint commit messages so that they obey to https://www.conventionalcommits.org/en/.
- Setup
npm run release
to usestandard-release
so that it automatically bumps the version and updates the CHANGELOG.md based on the commits
I made a comment in @satazor 's PR that begins to address this: https://github.com/orbitdb/ipfs-log/pull/213/files#r244635479
Reading back through these comments, I believe we should increment the version number v
field from 0
to 1
as well.
I would like to make a more formal proposal based on the discussion we had on #213.
Data-structures
It's normal for the data-structures of ipfs-log
to evolve over time. This happened once when we introduced IPLD links support and it will eventually happen again in the future.
All the code that interacts with those data-structures should always assume that they are in the latest version. This makes it easy to reason about the code because there's only one shape of the data-structures: the most recent one. Instead of doing this in an ad-hoc manner, we should come up with a scheme that would allow us to transform those data-structures from older to newer versions and vice-versa. These are the scenarios to take into consideration:
- When reading a log or an entry, we might be reading an older version.
In this scenario, we must transform the entry or log "upwards" to the most recent version of it. - When reading a log or an entry, we might be reading a newer version. In this scenario, we can't transform the entry and we should error out.
- When writing a log or an entry, we might need to write it in a older version to keep the same CID.
In this scenario, we must transform the entry or log "downwards" to its original version.
Having that said, I propose to tag all the data-structures with a v
property that contains its version. We already have that setup for entries but not for logs.
Assuming that we now have a consistent way to identity the version of a data-structure, we may have a versioning pipeline based on the following scheme:
const schema = [
versions: [
{
version: 0,
up(data) {},
down(data) {},
codec: { name: 'dag-pb-v0' }
},
{
version: 1,
up(data) {},
down(data) {},
codec: { name: 'dag-cbor', ipldLinks: ['next'] }
},
// more in the future...
],
codecs: {
'dab-pb-v0': {
matches(cid, dagNode) {}
fromDagNode(dagNode) {},
toDagNode(data, ipldLinks) {}
},
'dag-cbor': {
matches(cid, dagNode) {}
fromDagNode(dagNode) {},
toDagNode(data, ipldLinks) {}
},
// more in the future...
},
]
...where:
scheme.versions[].version
: The version number of the version entryscheme.versions[].up
: A function that receivesdata
and transforms it to the next versionscheme.versions[].down
: A function that receivesdata
and transforms it to the previous versionscheme.codecs[].matches
: Returns true ifdagNode
is of the given codec entryscheme.codecs[].fromDagNode
: Retrieves the underlyingdata
of thedagNode
scheme.codecs[].toDagNode
: Creates adagNode
for thedata
to be stored, converting anyipldLinks
to IPLD links
A versioning pipeline based on the schema, would have the following API:
verPipeline.read(scheme, dagNode): data
Reads the underlying data
of the dagNode
.
- Find the codec entry by calling
scheme.codecs[].matches
until one returnstrue
.- If none matches, error out.
- Retrieve the
data
stored in thedagNode
by callingfromDagNode
on the codec entry that matched. - Find the version entry that matches
data.v
fromscheme.versions[]
.- If none is found, error out.
- Transform any IPLD links into regular strings specified in the
codec.ipldLinks
of the version entry - Run every
up
function, starting fromdata.v
up to the most recent one - Tag
data
with its original version in case by definingdata.ov
as non-enumerable (ov stands for original version)
verPipeline.write(scheme, data): dagNode
Creates a dagNode
for the data
, according to its version.
- Find the version entry that matches
data.v
fromscheme.versions[]
.- If none matches, error out
- Find the version entry that matches
data.ov
fromscheme.versions[]
.- If none matches, error out
- Run every
down
function, starting from the version entry correspondent todata.v
down todata.ov
. - Find the
scheme.codecs[]
that matches thecodec.name
property of the version entry correspondent todata.ov
. - Calls the
toDagNode
from the codec entry with the correctipldLinks
based oncodec.ipldLinks
of the version entry correspondent todata.ov
.
Public API
Changes on the public API are not as problematic as changes to the data-structures.
Having backwards-compatibility normally comes as at the cost of code-complexity. Having said that, choosing to have backwards compatibility is a per-situation decision.
Nevertheless, a breaking change should always translate to a new major version of the module. Moreover, all the changes (fixes, features, breaking changes) should be easily visible to users of ipfs-log
. This is usually made possible via a changelog which can be automated using the right tools. I propose the following:
- Thrive for having one commit per PR. We can do this by using the
Squash
button instead of the regularMerge
when merging a PR. - Embrace Conventional Commits so that tools can infer the type of commit.
- Setup commitlint to ensure commits are valid
- Use
standard-version
to create new releases. This tool will automatically bump the version ofipfs-log
based on the commits made since the last release (breaking changes: major, feat: minor, fix: patch) and generate theCHANGELOG.md
file for us automatically
Let me know your thoughts!