uber/neuropod

Things to figure out before a major release

VivekPanyam opened this issue · 0 comments

Neuropod's release cadence has been a bit odd in that we've released a bunch of RCs since the initial public release, but haven't put out a new "stable" release (although the RCs are generally pretty stable and are used in prod).

There are a few things blocking a stable release:

  • ABI compatibility strategy (along with the ability to test and enforce it in CI)
  • Versioning strategy and the guarantees we're willing to stick to (e.g. a backend will work with all Neuropod libraries within a specific {MAJOR/MINOR}1 version, there will be a release with breaking changes at most once a quarter, etc)

1 This depends on if we want to follow semver or not. For example, PyTorch does a minor release every quarter (~90 days), but these minor releases contain backwards incompatible changes (which goes against semver rules): https://github.com/pytorch/pytorch/releases

Also see #539 (comment) and #552 (comment)

I think we need to intentionally decide on what to commit to before actually making a major release so feel free to leave thoughts in the comments.

Proposal

One approach is to do major releases with breaking changes at most quarterly and minor releases as necessary (e.g. when a new backend version is released).

  • No breaking changes in the core library public interface within a major release
  • Backend ABI compatibility within a major release

The benefits of backend ABI compatibility is that users can upgrade the core library (within a major version) without having to redownload all the backends they are using (this could be several GB). Currently (except for the last RC), every upgrade of the core Neuropod library required redownloading all the backends for the new version. This is especially impractical if backends upgrades are not done by the same team upgrading the core library.

The part I anticipate being difficult is ensuring we don't accidentally break backend ABI compatibility (forwards or backwards) within a major version. There are tools to help with this, but the situation within Neuropod is a little tricky. There are dependencies from (Neuropod core -> backend) and potential dependencies going the other way (backends -> Neuropod core).

One way to deal with this is to maintain ABI compatibility for both the core library and the backends, but that may be suboptimal. There are standard ABI compatibility checker tools, but we make extensive use of templates in the Neuropod public interface so I'm not sure how well those will work.

There are other solutions, but they require being careful about structuring includes, hiding symbols in libraries, maintaining field ordering in certain data structures, etc.

Most importantly, I think we need a robust way to check that we're not breaking anything in CI.

Two possible paths forward:

Option 1

  • Restructure files within Neuropod to clearly reflect what is part of the public interface and what is not. Add restrictions within Bazel so we don't accidentally depend on internal headers from public ones
  • Enforce that backends cannot depend on any external Neuropod symbols (and must compile them into the backend library itself as hidden symbols).
    • Note: this fixes some problems, but then adds implicit dependencies on us not changing internal field ordering, etc for types compiled into both the core library and the backends.
  • Check ABI compatibility on the backends in CI.

Option 2

  • Restructure files within Neuropod to clearly reflect what is part of the public interface and what is not. Add restrictions within Bazel so we don't accidentally depend on internal headers from public ones
  • Enforce that backends can only depend on the public Neuropod core interface
  • Check ABI compatibility for the public interface for both the core library (backwards only) and the backends (forwards and backwards). The tools we use need to be able to handle templated classes well (e.g.
    // A TypedNeuropodTensor is a NeuropodTensor of a specific type.
    // Backends should extend this class directly for their tensor implementations
    template <typename T>
    class TypedNeuropodTensor : public NeuropodTensor
    )

Feel free to leave any thoughts below. I think the major requirement for the solution we pick is that we need to be able to programmatically test and enforce it in CI.

If we can use an ABI compatibility checker that handles templates well, I'd prefer option 2 above as it seems more robust.