oxidecomputer/propolis

WS2016 guest panics Propolis after disabling/re-enabling network device from the guest

Closed this issue · 6 comments

Repro steps:

  • Obtain a Windows Server 2016 image with netkvm.sys from build 217 of the Fedora virtio drivers. (Note that the latest driver ISO is now virtio-win-0.1.240.iso, not 0.1.217; I will retest with the newer guest driver shortly.)
  • Boot the guest in a Propolis server.
  • Observe that there's no network connectivity. According to Device Manager (devmgmt.msc), the network driver is loaded correctly, but DHCP doesn't work properly. Running snoop on the relevant VNIC shows some DHCP activity but no IP address is ever assigned.
  • Run the network troubleshooter for this connection via Control Panel.

Expected: The guest network interface acquires an IP address and has connectivity with no additional work required.

Observed: Propolis panics with the following backtrace upon trying to run networking diagnostics:

thread 'vcpu-1' panicked at lib/propolis/src/hw/virtio/viona.rs:271:45:
not yet implemented: viona error handling
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: <propolis::hw::virtio::viona::PciVirtioViona as propolis::hw::virtio::VirtioDevice>::queue_change
   3: propolis::hw::virtio::pci::PciVirtioState::legacy_write
   4: propolis::hw::virtio::pci::<impl propolis::hw::pci::device::Device for D>::bar_rw::{{closure}}
   5: propolis::util::regmap::RegMap<ID>::process
   6: propolis::hw::pci::device::<impl propolis::hw::pci::Endpoint for D>::bar_rw
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This scenario appears to work correctly with Windows Server 2022 using the appropriate driver from the same driver ISO (i.e. the 217 build, not the new 240 build).

This also happens with the 0.1.240 driver ISO.

Some data points to gather:

  • See what happens with Server 2019
  • See if Server 2022 reproduces this behavior if I also subject it to the "diagnose my connectivity" button
    • One possibility is that this happens if the guest driver is unloaded/reloaded; try this manually

The Server 2022 image that was working for me last week now seems not to have any connectivity either. This seems to me to suggest operator error, possibly with the way I've set up the VNIC. I'll dig into that and then circle back to the Propolis panic.

Confirmed that WS2016's network driver works correctly when I configure the VM to use the correct VNIC. (There is a separate issue with the setup script I'm testing that I'll need to investigate.)

Leaving this open until I've rechecked the "run diagnostics"/"restart the driver" paths again.

This is reproducible by disabling and re-enabling the device in Device Manager (no need to go through the entire troubleshooting flow). On Server 2019 and 2022 there are Disable-PnpDevice and Enable-PnpDevice Powershell cmdlets that make this a little easier to do, provided you have serial console access.

With the proposed fix in place, I'm able to go through a disable/enable cycle without propolis panic (and with a still-working device afterwards):

PS C:\Windows\system32> get-pnpdevice -class net | Select-Object -property instanceid

InstanceId
----------
ROOT\KDNIC\0000
PCI\VEN_8086&DEV_10D3&SUBSYS_00008086&REV_00\3&13C0B0C5&0&18
PCI\VEN_1AF4&DEV_1000&SUBSYS_00011AF4&REV_00\3&267A616A&0&40


PS C:\Windows\system32> disable-pnpdevice -instanceid 'PCI\VEN_1AF4&DEV_1000&SUBSYS_00011AF4&REV_00\3&267A616A&0&40'

Confirm
Are you sure you want to perform this action?
Performing the operation "Disable" on target "Win32_PnPEntity: Red Hat VirtIO Ethernet Adapter (DeviceID = "PCI\VEN_1AF4&DEV_1000&SUBSYS_00011AF4&R...)".      
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): y
PS C:\Windows\system32> enable-pnpdevice -instanceid 'PCI\VEN_1AF4&DEV_1000&SUBSYS_00011AF4&REV_00\3&267A616A&0&40'

Confirm
Are you sure you want to perform this action?
Performing the operation "Enable" on target "Win32_PnPEntity: Red Hat VirtIO Ethernet Adapter (DeviceID = "PCI\VEN_1AF4&DEV_1000&SUBSYS_00011AF4&R...)".       
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend  [?] Help (default is "Y"): y
PS C:\Windows\system32>