oxidecomputer/propolis

PHD: guest OS adapter for Windows Server 2022/2019/2016

Closed this issue · 2 comments

Provided that we start with images that have EMS enabled and an appropriate admin password set, it's not too hard to write an adapter that gets us to a Windows command prompt on these guest OSes. The bigger challenge is what to do from there. One of the design goals of PHD is to make tests as concise as possible, and it's hard to have concise tests if every test has to have conditional logic that decides whether to use Unix-flavored or Windows-flavored commands. An emulation layer like Cygwin might help with this, and indeed I've built a Windows Server 2022 image that includes it that seems to work fine with PHD, but I'm running into some difficulty with WS2019 that I'm still investigating (will detail this in an issue comment once I've dug in more).

Two hiccups with the Cygwin/Server 2019 configuration so far:

  • When running with Cygwin 3.4.9, in some cases, trying to interact with Cygwin over the serial console causes the bash process it spawns and its corresponding conhost process to get stuck in a CPU busy loop. I took a performance trace and observed a ton of context switches where cygwin1.dll was calling WriteConsoleInputW and then switching out to a conhost thread, only to be switched back in immediately by that thread. The Cygwin Git repo history shows a few changes to its main input processing thread's logic (which is the logic that calls the console APIs) over the last couple of years. I'm not sure what specifically the problem is here, but it looks like downgrading to version 3.0.7 of the Cygwin packages (including the main Cygwin glue DLL) mitigates the issue.
  • Server 2019's SAC channel appears to differ from Server 2022's in that the former tries very hard to emulate an 80x24 VT100 terminal while the latter does not. That is, while Server 2022 happily echoes the newlines PHD types into it and prints trailing spaces when setting up a Bash prompt and running commands, Server 2019 prefers to use VT100 cursor positioning commands to move the cursor around. PHD's terminal processing is currently extremely basic and doesn't cope with this at all. I need either to find a way to suppress this behavior in WS2019 (no luck so far) or to expand the VT logic in PHD to handle this mode of operation.

I've been thinking about what to do for WS2019 guests today and have an approach in mind. Here are my notes.

The general problem is that guests have (at least) two approaches to writing to the serial console:

  1. Some guests write and echo characters and control bytes "as-is" and allow the recipient to decide how to format them for display in a terminal.
  2. WS2019's serial console command prompt instead assumes its recipient is a VT100-compatible terminal and manages all of its formatting in-guest, outputting characters and control sequences so as to manage directly what gets displayed on its (presumed) 80x24 terminal target.

PHD's serial console logic assumes case (1): regular characters and LF bytes are written into a contiguous back buffer, and other command bytes and sequences are ignored. This makes it relatively straightforward to support the "wait for message on the serial console" operation that tests generally want to perform. In this operation:

  • The framework searches for the first appearance of the target string in the back buffer
  • Everything before the target is returned to the caller
  • Everything after the target is preserved
  • The target itself is consumed and not returned anywhere

This makes it relatively easy to run commands and wait for their outputs. First, the test waits for the command to be echoed, which consumes everything up to and including the command. Then it waits for the shell prompt to be echoed. The framework consumes the prompt and returns everything that was written before it, which return value can be interpreted as the command's output.

Case (2) guests are much more complicated to deal with. Server 2019 in particular exhibits the following challenges:

  1. The guest doesn't reliably echo spaces (ASCII 0x20) if it thinks the terminal cell containing the space already contained one. Instead it just uses cursor positioning commands (Esc [ <v> ; <h> H sequences) to skip the blank cells.
  2. The guest doesn't output CR or LF characters; instead, if it needs to break a line, it just moves the cursor to the start of the next row.
  3. The guest handles scrollback itself by using cursor positioning commands and line clear commands (Esc [ <Ps> K) to redraw the entire terminal when it needs to scroll.

This makes it really difficult to reason about what should go into the contiguous back buffer: is the cursor moving to the next line because the guest is emulating wrapping, or is this a real newline? Is this character new output or is the guest just redrawing the terminal with existing data moved around?

The simplest way to address this problem would be to convince WS2019 to act more like a case (1) guest, but if this is possible I haven't yet figured out how to do it.

In lieu of a guest-based solution, I think the simplest approach to case (2) guests is to do the following:

  • Don't have a back buffer at all; instead, the only buffer is what's on the current virtual terminal
  • When issuing commands to the guest, prepend them with a shell command that clears the screen (clear for Cygwin, cls for a native command prompt), e.g. clear && nproc

In this way, searches for echoed commands can still resolve (as long as they fit on one screen, anyway). After a command is echoed and PHD sends the final LF to execute it, the screen is cleared, the command output is written to (0, 0), and the last thing on the screen is a prompt. If this output is searched for a prompt with the method above, it will yield the correct results (modulo some massaging of trailing spaces in a row of characters that I'll need to figure out).

The main difficulty with this approach is that it doesn't capture output that the guest prints but then immediately scrolls off the screen. This is possible because serial console updates are currently processed asynchronously--i.e., the buffer is updated as soon as something comes in on the websocket, regardless of what the main test thread is doing. We can change this, but in this case I don't think we have to: WS2019, at least, won't even bother writing text to the serial console if it scrolls off "too quickly"--it will just draw the final state of the terminal. So, if a command's output exceeds 24 lines (minus a new command prompt), there's no guarantee it'll show up on the console anyway.

One last wrinkle here: this only applies to WS2019's serial console command prompt. The regular SAC channel has case (1) semantics. Fortunately, we know when we can cut over from one to the other, because (a) there's a specific point in the logon sequence where this happens, and (b) cmd clears the entire terminal when it starts up.

Building all this is complicated somewhat by the brittle and confusing architecture of the PHD serial console code (who's this @gjcolombo person who wrote it, anyway?). So I think the plan here should be

  • Refactor this code to allow different buffering modes to be plugged into a VM's serial console (probably some kind of trait-based abstraction)
  • Test with existing case (1) guests to make sure the refactored code works
  • Decide how to handle case (2) guests (can probably use a crate like vt100 or termwiz to do a lot of the heavy lifting)
  • Wire up a WS2019 guest adapter and see what happens