ref-fvm version philosophy
jennijuju opened this issue · 23 comments
Starting a discussion for the ref-fvm version philosophy.
Things to consider:
- do we associate FVM version w/ network upgrade versions?
- what's the approach that may support multiple fvm version in the future.
- what's the FVM version would be for network v16 M1 FVM?
There are roughly two approaches here:
- One FVM release per network version, with client importing all and selecting which version of the FVM to instantiate depending on the network version.
- Pros: theoretically cleaner; every network version is self contained.
- Cons: the integration of through FFI, etc. is going to be quite hard; higher complexity; binary bloat.
- Single FVM import, with support for network versions added through version guards (like today).
- Pros: much simpler; easier to reason about; easier to develop / manage the codebase; smaller binary sizes; no FFI drama.
- Cons: less clean; no isolation/self-containment.
I am more agreeable to (2), and it's going to take quite a bit of effort to convince me that (1) is the better approach.
@Stebalien Please add your input here, and let's aim to close this issue with a final decision.
First, because I'm pedantic and want us to be on the same page:
Multiple FVMs:
- Bloat: This isn't correct. It'll all be in a single crate so the only "bloat" will be from the FVM code itself, not the dependencies. The FVM logic itself is pretty tiny and many of the types actually live in "shared".
- Integration in the FFI: this is pretty easy. We'll need a small abstraction layer, but that comes down to: a constructor, an "apply message" method, and a "flush" method. We'd also likely want to break out the "externs" interface into a new crate, but that's not difficult.
- High complexity: I'm not sure how not having to support old network versions in the FVM will lead to higher complexity.
Single FVM:
- Simpler: See my previous comment.
- Easier to reason about: Multiple FVM versions means less complexity in each version, which should be easier to reason about. Unless you mean "easier to reason about what changed", in which case I agree.
- Easier to develop/manage: I agree.
So really, the question is development. If we have multiple versions:
- Backporting fixes becomes annoying.
- The "new" version won't be stress-tested on the network until the network upgrade.
- It can get harder to see what was "introduced" in the new network version.
But I think I have a reasonably balanced solution:
- Develop the "next" version with version conditionals.
- After the network upgrade, release a new FVM version that doesn't support previous network versions.
This has some advantages:
- We don't need to support "old" network versions in master.
- We can punt the FFI abstraction until NV18 (because we'll build NV17 in the same FVM version as NV16).
- Network upgrades won't immediately switch to a "fresh" FVM.
Effectively this means:
FVM v1.0 will support NV16.
FVM v1.1 will support NV16 and NV17.
FVM v1.2 will support NV17.
FVM v1.3 will support NV17 and NV18.
Now that we're embarked on M2 development, I'd like to revisit this discussion and come to a consensus. I'd like to hear from client implementors about the "sliding versions" approach, as they are the main stakeholder that will need to reason about ref-fvm versions.
Tagging: @filecoin-project/lotus-maintainers @filecoin-project/forest @filecoin-project/venus
- FVM 1.0 has shipped (well, it's in RC) with nv15 and nv16 support. nv15 support only exists for testing purposes and we can assume it's a one-off breakage of the above model due to the big milestone, the legacy VM => FVM migration. To clean this up, we could:
- a) Retain nv15 only in RCs, and drop nv15 before we ship the final v1.0.0; however, this requires a change in code between the last RC and final, which is risky and not something I'd advise.
- b) Ship 1.0.0 final with nv15 and nv16, and immediately follow up with an 1.0.1 that removes nv15 support, to be adopted by clients in their next bugfix release.
- If a client wants to maintain historical sync ability (e.g. Lotus), it will need to import every even minor ref-fvm version (1.0, 1.2, 1.4) and the FVM selector abstraction described above. This is true, except for upgrade releases which would import even minor ref-fvm versions except for the latest, which would be an odd ref-fvm version (1.0, 1.2, 1.3).
@Stebalien Separately from the above, I wanted to talk specifically about version number usage. The model you propose implies staying on v1 forever, but I'm pretty sure we'll introduce breaking changes in M2 and beyond. So staying on v1 would be incorrect. There are three approaches possible IMO:
- Staying on v1 forever, disregarding breaking changes as events to bump major. Strongly oppose.
- Adopting semver, breaking changes bump major version. We'd end up with an uneven version history (some NVs will break APIs, others won't), and we'd need a tracking table to reason about version alignment.
- This adds cognitive overhead.
- We don't expecct public API breakages outside of network version upgrades. So what's primarrily semantically relevant to us is network version alignment.
- Every network version gets a major bump. Not every NV will break interfaces, but this installs a predictable and straightforward pattern.
- The N.0.0 version could be the one that supports the before and after network versions.
- The N.1.0 version would drop support for the before network version.
I am not a client implementer so my suggestions should carry relatively little weight. I think that Raul's original suggestion of single FVM import with support for multiple versions has the best potential to not be a pain. Manage the versioning of behaviour internally in the code for FVM, and maintain as much historical support as reasonable through methods within the code, rather than external to the code in repository branching and versioning. I wouldn't weigh bundle size very highly yet, but maybe a stripped-down build could drop historical support in the future.
I would not couple the versions of FVM to network upgrades very tightly.
- There will be many reasons to release new (major) versions of the VM that don't change it's consensus-critical behaviour: new client APIs, better inspection facilities, optimisations, debugging tools, etc etc. You don't want to hold them back for network releases. Clients should be able to upgrade to the latest hotness at any point (and I would try to extend this to full archive clients).
- Necessary changes to behaviour for network upgrade will require new FVM releases. But such release could come well before the network "turns on" the new behaviour via network version upgrade.
I would not couple the versions of FVM to network upgrades very tightly.
I really think it depends on the release, but I expect that M2 will have some significant changes (especially around how accounts are handled). I expect that most network upgrades won't involve versioning the FVM itself as backwards compatibility will become more and more important as we get user-defined actors.
In terms of abstracting over multiple versions, I was assuming we'd do this at the FFI layer, but we can always add a new "abstraction" crate that can pull in older versions of the FVM. That should be significantly cleaner than a bunch of ifdefs.
@anorth Now that we have a better line of sight into the kinds and magnitudes of code changes that will go into M2, as well as future features, I'm withdrawing my skepticism around versioning ref-fvm and having multiple version imports to support various NVs. It will lead to cleaner code, and "flip the page" mental model as we ship new ref-fvm releases supporting new NVs.
I fear that the build up from supporting every network version in the past will rapidly muddy the codebase. There are three kinds of changes that can go in network upgrades. We're OK with the first, kinda OK with the second one, and definitely not good with the third:
- Backwards compatible changes. For example, reachability analysis, refcounting/linkstore, etc. These don't require network guards because the new logic should be a noop in terms of consensus with the previous logic.
- Easily gateable/configurable changes, e.g. gas, the set of wired syscalls, etc. These would be simple ifs.
- More fundamental/invasive changes, e.g. account abstraction model, code upgrades, etc. If we have to selectively enable these features, the code would get very unreadable.
@Stebalien and I caught up today and we're aligned on this approach.
- Every network version is a new major version of ref-fvm, with a "flip the page" mental model.
- nv16 => ref-fvm v1, nv17 => ref-fvm v2, nv18 => ref-fvm v3...
- What do we do with NVs that don't require ref-fvm changes? Do we strive for uniformity (and re-release with no changes), or do we omit the bump? I prefer the latter; the relationship between ref-fvm versions and NVs would be one-to-many.
- Code changes can safely assume that they're targeting only that network version onwards. Therefore, they don't need to switch behaviours depending on network version.
- However, on every new version will strive to maintain backwards compatibilty with the previous version, as this facilitates canary testing on mainnet. This may not always be feasible/sustainable. If during development the effort/value tradeoff is no longer justified, we will drop compatibility for that release.
- Backwards compatibility would only exist during alpha, beta, pre, and rc releases (as it's only useful for testing).
- For the final release, all backwards compatibility will be dropped.
- When performing upstream dependency upgrades, we may backport them to previous ref-fvm versions as this will keep the dep graph small and the resulting filecoin-ffi library lighter.
- On Monday after #605 is merged, we will create the
release/v1
branch and master will switch to v2. - We can have an selector crate that selects the right fvm dependency given a NV or an epoch (assuming mainnet, always).
Code changes can safely assume that they're targeting only that network version onwards
What if its a bug fix thats not consensus breaking? are we going to use minor version for that?
question what will happen in the following situation:
- master version
v2-dev
, network version 88 - feature A introduced by
FIP-1001
gets implemented and merged intov2-dev
- feature B, independent of feature A, introduced by
FIP-1002
gets implemented and merged intov2-dev
- core devs decide to finalize only
FIP-1001
in nv89
how will releasev2
be cut?
Every network version is a new major version of ref-fvm, with a "flip the page" mental model.
For M2, yes. But not necessarily every network version. We're going to have a lot of network versions that don't require large changes to the FVM. Really, I expect most upgrades to require much smaller changes like adding syscalls.
core devs decide to finalize only FIP-1001 in nv89. how will release v2 be cut?
It really depends on the change. IMO, we'd want to add some version guards on the FIP-1001 feature. I think our primary concern is accidentally introducing consensus-breaking changes as we refactor large portions of the FVM. If the code is already there (even if it's behind an if statement), there's a lot less to be concerned about.
If the change is really invasive, we might just have to either revert.
core devs decide to finalize only FIP-1001 in nv89. how will release v2 be cut?
the decision of what new things get finalized should probably happen around beta
and pre
(maybe rc
?) stages, managing these release stages (im guessing different branches, or CI something something) seems that it may become cumbersome, but it seems to work nicely for rust-lang
Every network version is a new major version of ref-fvm, with a "flip the page" mental model.
For M2, yes. But not necessarily every network version. We're going to have a lot of network versions that don't require large changes to the FVM. Really, I expect most upgrades to require much smaller changes like adding syscalls.
Agree, but the point I'm trying to drive home is that regardless of the scope/impact of the change, the mental model is: "this new version doesn't need to preserve backwards compatibility."
core devs decide to finalize only FIP-1001 in nv89. how will release v2 be cut?
the decision of what new things get finalized should probably happen around
beta
andpre
(mayberc
?) stages, managing these release stages (im guessing different branches, or CI something something) seems that it may become cumbersome, but it seems to work nicely for rust-lang
So only cherry pick the needed PR when it’s ship time? If that’s the case sgtm!
This discussion resulted in a successful outcome above. Closing.
@lemmih from the forest team has raised that: currently fvm publish different versions of fvm
creates to support different filecoin network version. For example, fvm
@2.0.0 for nv16/17 and fvm
@3.0.0 for nv18.
"Such a scheme is problematic because version 2 of the fvm crate cannot be extended with breaking changes. Its dependencies will become dated and security patches will not be back-ported. As such, version 2 will for all practical purposes, be unmaintained. From Forest's perspective, this is a complete show-stopper for us."
Steb pointed out that currently, such described patches are being released with minor version. Other than thats bad semantic versioning, but more importantly it means minor versions may introducing breaking changes that will break down streaming users. If such behaviour is expected, there should be a huge warning to user in README.
But in general, thats not a good eng, practise. Both David and Steb suggested, we should consider to respect the semantic versioning rule more, and introducing new create name for each version of fvm. i.e: fvm_v1@
for nv16/17 and fvm_v2@
for nv18. Each of those crates could then be maintained.
From the filecoin-ffi side I have some concerns around scope expansion of the filecoin-ffi project with this fvm release strategy. For fil-crypto team, we are using filecoin-ffi to export Go bindings for our API crate, filecoin-proofs-api. The filecoin-proofs-api rust crate handles which APIs are available for which network versions (groth16, halo2, testudo, vector commits, etc).
For FVM in filecoin-ffi, it is also exporting a Golang interface, but in addition to that it is including a translation layer to handle differences between the different fvm crates (fvm2, fvm3). I would suggest moving the translation layer into it’s own wrapper/API rust crate to make sure the various FVM crates play nice with each other and dependencies are resolved at that layer. This would make the filecoin-ffi build much simpler, and also help out the Rust based Filecoin clients which do not use filecoin-ffi but may need to build with multiple FVM versions and would otherwise have to deal with resolving those dependency conflicts in their own projects.
But in general, thats not a good eng, practise. Both David and Step suggested, we should consider to respect the semantic versioning rule more, and introducing new create name for each version of fvm. i.e: fvm_v1@ for nv16/17 and fvm_v2@ for nv18. Each of those crates could then be maintained.
What about supporting all previous network versions in one fvm release and handling the logic internally in the crate? That way you would be able to use whatever network version you desire for your project but still always have the latest code.
We had the same discussion around a year ago but I cannot recall where, maybe on slack...
But in general, thats not a good eng, practise. Both David and Step suggested, we should consider to respect the semantic versioning rule more, and introducing new create name for each version of fvm. i.e: fvm_v1@ for nv16/17 and fvm_v2@ for nv18. Each of those crates could then be maintained.
What about supporting all previous network versions in one fvm release and handling the logic internally in the crate? That way you would be able to use whatever network version you desire for your project but still always have the latest code.
We had the same discussion around a year ago but I cannot recall where, maybe on slack...
I'm strongly in favor of making the fvm
backwards compatible with older network versions. If accidental changes to the logic of older network versions is a concern, we at ChainSafe would gladly set up the infrastructure for running exhaustive compliance tests: For each new release of the fvm
, we could re-validate every block of the mainnet blockchain (starting at nv16 when the FVM became mandatory). This would allow the code to be actively maintained while guaranteeing that the logic of the past stay unchanged.
It's not about accidentally breaking things, its about the complexity of supporting multiple versions. We've shipped the existing versions of the FVM knowing that we'll need to change things in the future. Future FVM versions will:
- Add additional validation logic not present in previous versions.
- Completely change many of the system calls.
- Potentially change the entire syscall ABI.
- Change how/when we charge gas.
Making these changes depend on a network version internally would be difficult, to put it mildly. We're not making minor tweaks and adjustments, we're making very large changes to how everything works under the covers.
Ideally, we'd have a shim inside the FVM itself to mux between different FVM versions.
Also note: once we have multiple FVM crates, this should become much easier. We can then make backwards incompatible changes to prior FVM crates such that the APIs use common shared types. That will let us write a simple abstraction to multiplex between multiple VM versions.
I think it's a philosophical discussion. In general, blockchains by nature want to be backwards compatible as the history, to some degree, is always important in the present. I have noticed that Filecoin works differently here. But traditionally, you want to maintain that extra logic to ensure future compatibility with previous protocol versions. Outsourcing that logic into stand-alone crates basically results in letting this part of the protocol die as nobody wants to maintain marginally used crates. You could now argue the same is true for having everything in one crate, however, that's the better design decision architecture and maintenance wise (in my opinion).
@q9f I understand the point you're raising, but ref-fvm nor this versioning policy limits or conditions historical support in any manner, as far as I can tell. Clients are free to retain support for historical network versions in their codebases, and that support would just use the corresponding ref-fvm major version.
Definitely agree that the ref-fvm selector should live in a separate repo or ref-fvm (I prefer the latter), so that Forest can reuse it.