instance specs could be improved
Closed this issue · 2 comments
The problem(s)
The instance spec types in the propolis_api_types
crate are an important conceptual part of the live migration protocol, but there are several things about the way I implemented them that leave things to be desired:
- Well-formed specs are not necessarily valid
It's easy to construct an instance spec that type-checks but can't be used to start an instance. For example, it's trivial to have a spec with a storage device with no valid backend.
- Specs have a complex hierarchy of component types
Instance specs are composed of device specs and backend specs, each of which are further divided into their own fields and sub-collections. This makes adding new components while respecting the spec versioning rules something of a hassle (and the extra structure doesn't even guarantee the specs are valid!).
- Not all components have names
Live migration assumes that all (migratory) VM components have names. This allows Propolis to send a source component's saved state to the correct counterpart on the target. While many spec components are stored in a HashMap<String, Component>
, some are not, so Propolis has to generate names for them, which then have to be kept consistent between Propolis versions.
- There are three different spec builders
One lives in propolis_api_types
. propolis-server
then wraps that in its own builder, which can also convert config TOML entries and InstanceEnsureRequest
components to spec components. Yet a third builder lives in propolis-client
; it's basically the same as the builder in propolis_api_types
, except that it operates on Progenitor-generated types instead of the raw types.
propolis_api_types
contains a lot more than just types
For a crate that's ostensibly just a bag of types, this crate has a lot of code, most of which is only ever used by Propolis servers. Other Propolis components that want Just The Types, Please shouldn't need to pull in all the server-only methods.
- Some component types are conditionally compiled
The SoftNPU component definitions in propolis_api_types
are behind Falcon feature guards. This forces us to maintain two separate OpenAPI documents, one with Falcon enabled and one without, which can lead to annoying rake-steps in CI if one forgets to check the build for one of the two variants. (In fact, as I write this, PHD builds are broken when Falcon is enabled because of a missing patch
directive in the Falcon client generator...)
Proposed solutions
I propose the following changes to fix these problems:
- Create an internal
Spec
type for propolis-server
This type is the VM component manifest type that Propolis server uses internally (in particular during VM setup). This spec isn't Serialize
, isn't part of the API, and isn't used in the live migration protocol. Because this spec can be structured however we like without concern for versioning it, it can (and will) be structured so that illegal states are unrepresentable.
- Flatten
propolis_api_types::v0::InstanceSpecV0
Instead of a complex hierarchy of component types, the "on the wire" spec used in the API and LM protocol is just a mainboard and a HashMap<String, Component>
. Adding a new component type is very easy: just define a struct for it and add it to enum Component
. Because all components are now in a map, they all have names.
- Define conversions between the two spec types in propolis-server
propolis-server implements From<InternalSpec> for VersionedInstanceSpec>
and TryFrom<VersionedInstanceSpec> for InternalSpec
and converts to/from its internal type at the API and LM protocol boundaries. The TryFrom
impl gives us a "parse, don't validate" discipline for incoming API specs.
- Just one builder in propolis-server
propolis-server understands how to convert config TOML elements and InstanceEnsureRequest
fields into spec elements and feeds them into a single InternalSpec
builder. The TryFrom<VersionedInstanceSpec>
impl also operates in terms of this builder. The propolis_api_types
builder and propolis_client
builders are removed. The only user of the latter (PHD) builds API specs simply by adding components to a new V0 spec's map. (This plan assumes there won't eventually be other Propolis clients who want to initialize VMs using raw specs. I had originally planned for the control plane to do this, but have changed my mind and now think it should continue using the existing instance-ensure API. See the headnote to RFD 344 for more details.)
- Move migration compat checks into propolis-server
The migration compatibility traits that live in propolis_api_types
are only useful in the LM protocol. They should live in propolis-server instead (which can then implement them in terms of its internal spec representation, if it's convenient to do so).
- Remove
#[cfg(feature)]
guards from API types
Instead of compiling components out and having different clients built with different feature sets, there's just one client that defines all possible component types. If a server doesn't support a component a client has asked for, it rejects the spec (once again made easier by having an internal spec type to parse into--if a feature is compiled out of the server, then the internal type won't have fields for its components).
I have drafted a set of changes that achieves these goals and am pretty happy with how it looks; indeed, I've mostly filed this issue to have a wall of text to link the PRs to. I'll start putting those up for review presently.
Doing this now lays the groundwork for adding CPUID information to instance specs and initializing migration targets from sources' specifications. (It's also much easier to correct foundational mistakes like these while the API and migration protocol can still change without strictly requiring new versions of everything.)
All of the recommendations described here seem good to me!
Another improvement for the pile: propolis_api_types
contains some non-default Debug impls for some of the component types, such as the blob storage backend1 and Crucible backend. Over in propolis-client
, however, we're using the Progenitor patch
directive to add Debug impls to several top-level instance spec types, which (I believe) causes Progenitor to derive Debug for all of the possible component types. The upshot of this is that we can't manually derive Debug for the generated types in the client, which can cause their output to be... verbose2. It would be nice to fix this; the simplest thing is likely to be to switch to a Progenitor replace
directive once the final component type structure is in place.
Footnotes
-
base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" ↩
-
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ↩