oxidecomputer/helios

split and regularise feature setup

Opened this issue · 4 comments

As time has gone on, the initial set of features has expanded, and we've also learned more about use cases and needed tooling. We now have, at least, the following set of features one may select at build time:

  • repo_redist
  • repo_url
  • omicron1
  • opte
  • optever
  • tofino
  • mfg
  • stlouis
  • compliance
  • stress

We now know that we have a few 3 or 4 distinct use cases:

  • recovery/installinator/factory programming which can either be collapsed into 1 or leave factory programming separate temporarily
  • benchtop, where the "old school" tooling remains far and away the most useful way to operate
  • in-rack, without control plane software
  • production

We would expect that in-rack without control plane software (where compliance-pilot takes its place) will go away once upstack software matures, though that use case may also remain relevant longer for benchtop setups with Sidecars where there is a desire to more completely emulate the in-rack environment e.g. for Hubris CI.

Unfortunately, compliance-pilot while an excellent bootstrapping tool for in-rack use is not really suited to downstack benchtop investigations. There we still want to be able to set up arbitrary tooling in /data, e.g. putting dumps in /tmp is not really suitable, and so on. IOW, we had it mostly right the first time for this use case, even though this is emphatically unsuitable for anything involving control plane software (witness the sn21 debacle). So we should keep this and coalesce this into a "benchtop" feature set that is mutually exclusive with others.

omicron1, opte, and optever all seem to be essential to the production image and could be in the default feature set (removed for benchtop and bench-CI).

The stlouis feature currently controls only two things: inclusion of T6 firmware, and when building with the compliance feature, inclusion of an obsolete t6-mission service. This feature should be deleted, with inclusion of T6 firmware made universal and the obsolete service removed.

The tofino feature should be universal, as everything needed for it has landed in stlouis.

The stress feature can be rolled into the benchtop feature.

There is undoubtedly scope for futher discussion and work here to best accommodate the needs of various teams. The above probably represents primarily two points of view: someone who wants the old benchtop environment because what one gets when building with compliance is very much not desired, and someone who is looking over the OS feature gating and looking for opportunities to remove complexity that used to be needed but is no more.

I thought I had removed the stlouis feature flag completely after Andy landed t6init -- where do you see it?

I have been looking at introducing TOML based templates FWIW that would specify sets of features and give them a name.

If we had a benchtop.toml would you just want that to include:

  • the old postboot.sh that looks for a data pool, imports, and runs the postboot script or finds within that pool
  • the files that disable even the blank root password

... and you'll deal with everything else? Would you want it to bring up IPv6 link local addresses on cxgbes and igbs if found and enable root SSH as well?

I thought I had removed the stlouis feature flag completely after Andy landed t6init -- where do you see it?

In a change of mine that I'd had pending. You have indeed removed it; thank you!

I have been looking at introducing TOML based templates FWIW that would specify sets of features and give them a name.

This would be a great step toward what I've described poorly here, yes.

If we had a benchtop.toml would you just want that to include:

* the old `postboot.sh` that looks for a data pool, imports, and runs the postboot script or finds within that pool

* the files that disable even the blank root password

... and you'll deal with everything else? Would you want it to bring up IPv6 link local addresses on cxgbes and igbs if found and enable root SSH as well?

This is a good question. Bringing up link-local addresses and starting sshd is fantastic if one is working on a benchtop system where the only console device available is the SWD proxy, which is ~150 baud. That said, if we're still running the local postboot script, it's necessary to start the network by hand only once which isn't too bad. The reason I think we should do none of this is that during bringup and perhaps some not-terribly-unusual benchtop use cases, even bringing up a datalink can panic or hang. I'm thinking specifically of oxidecomputer/stlouis#61 but there are plenty of other bugs that behave similarly. So my gut feeling here is that if one is using this, it's because either we're bringing up a board or we're doing debugging of some kind that just can't be done in-rack. In that situation I think it's pretty clear that one should expect very little. It might also be a good idea to log and spew a warning that the behaviour of the machine depends on local one-off state that is not part of the OS image (i.e., hi there lab machine user, be sure to set up this lab machine instead of making assumptions about what the person before you did) to help prevent sn21-type problems.

I'm also thinking here about this being usable at a repair depot, either until we have a set of boards to plug in that can mimic being in the rack or for problems that require probing of one kind of another. That may end up being entirely different, I don't know. But one option would be to stick a diagnostic SSD in the machine that contains stuff that this mechanism would run.

MVP+1 no longer makes sense given the decision to ship something nonviable, so I've moved this to MVP. It is part of a larger set of changes needed to the build system to have each software and firmware component (e.g., stlouis, omicron) deliver independently to a dock from which something like helios-build can assemble everything into a single image. Related work includes building releases (things customers can get) entirely from source with a single shared proto area and no incremental builds. The entire set of work is required for the lowest level of build reproducibility, and therefore for viability.