oxidecomputer/propolis

Guest reboot got stuck at unmounting /boot/efi

Closed this issue · 8 comments

I ran into this issue with ubuntu instance reboot both while rebooting with API and in-guest reboot command.

From the serial console, I could see that the guest failed to make any progress at this point when I did sudo reboot:

[  OK  ] Stopped Update UTMP about System Boot/Shutdown.
[  OK  ] Stopped Network Time Synchronization.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
[  OK  ] Stopped target Mounted snaps.
         Unmounting /boot/efi...
         Unmounting /run/snapd/ns/lxd.mnt...

With instance reboot API, it got stuck at start up instead:

[  OK  ] Started Session 1 of User ubuntu.
[  301.807462] cloud-init[1378]: loading tables: 2 for warehouse: 10
[  302.097101] cloud-init[1378]: loading tables: 2 for warehouse: 9
         Starting Download data for…led at package install time...
[  OK  ] Finished Download data for…ailed at package install time.
         Starting Time & Date Service...

I repro'ed this and saw that after a reboot, the guest firmware logged several NVMe command timeouts of the following form:

 == NVME IDENTIFY CONTROLLER DATA ==                   
    PCI VID   : 0x1DE                                  
    PCI SSVID : 0x1DE                                        
    SN        : block0                                       
    MN        :                                              
    FR        : 0x0                                          
    RAB       : 0x0                                          
    IEEE      : 0x2540A8                                     
    AERL      : 0x0                                    
    SQES      : 0x66                                           
    CQES      : 0x44                                         
    NN        : 0x1                                                    
FatOpenDevice: read of part_lba failed Time out              
 BlockSize : 512                                             
 LastBlock : 3FFFFF                                          
NvmExpressPassThru: Timeout occurs for an NVMe command.      
Cc.En: 0                                               
Cc.Css: 0                                              
Cc.Mps: 0                                                    
Cc.Ams: 0                                                    
Cc.Shn: 0                                                    
Cc.Iosqes: 6                                                 
Cc.Iocqes: 4                                                 
NVMe controller is disabled with status [Success]. 

After a while it dumped me into a firmware shell. I can repro this at 6e48b6c but not 4591a60.

I'm still grokking all this code, but for synchronous backends (like the file-based one I was using) I think this is an issue in BlockData::block_for_req:

fn block_for_req(
&self,
att_state: &attachment::AttachState,
) -> Option<Request> {
loop {
match self.next_req(att_state) {
Ok(req) => {
return Some(req);
}
Err(err) => match err {
ReqError::NonePending | ReqError::Paused => {
let guard = self.lock.lock().unwrap();
// Double-check for attachment-related "error"
// conditions under protection of the lock before
// finally blocking.
if check_state(att_state).is_err() {
return None;
}
let _guard = self.cv.wait(guard).unwrap();
continue;
}
_ => {
return None;
}
},
}
}
}
}

This checks for a paused attachment on line 68, but then rechecks the attachment state under the lock on line 73 and bails if the attachment is still paused. This bubbles back up to block::file::WorkerState::processing_loop and causes it to exit. Since this is a pause/resume and not a start/stop, the processing workers don't get restarted when the NVMe device resumes, so the disk becomes inoperable. Changing the check on line 73 to bail only if the backend is detached or stopped seems to fix the issue.

This was also seen in the async (Crucible) backend case; still investigating that one.

The async backends would use the WaitForReq future which doesn't seem to have the same issue on first glance. But if crucible is also running into it, now I wonder if there's some timing issue notifying on resume and having it poll again...

The WaitForReq future looks OK to me as well. I don't seem to be able to repro this with a local VM that has only a Crucible disk attached--I thought I could, but it turned out the test case I'd written was buggy (oops).

The VM I'm testing is still a little different from what Omicron would generate, since the latter VM will have a cloud-init disk, which has a sync in-memory backend. I'll try to conjure something like that up in PHD and see what happens.

ah yea, that would track. Given the fact an API-initiated reboot actually got to Ubuntu again would imply the crucible-backed nvme worked as expected and it got hung up on the cloud-init disk. Though, that doesn't quite explain getting stuck from a guest-initiated reboot...

I was able to repro the original issue by writing a PHD test that creates a VM with a Crucible boot disk and a file-backed blank virtio data disk. If the data disk is an NVMe disk, the test passes, but there's a lot of complaining in the serial console about NVMe command timeouts. I suspect the firmware handles virtio disks slightly differently in a way that explains this behavior, but haven't spelunked through the FW debug logs enough to know for sure.

Given the fact an API-initiated reboot actually got to Ubuntu again would imply the crucible-backed nvme worked as expected and it got hung up on the cloud-init disk. Though, that doesn't quite explain getting stuck from a guest-initiated reboot...

I think this tracks. Guest-initiated and API-initiated reboots end up going through the same state driver operations (pause/reset/resume devices)--the only difference is who pushes the button.

I was able to repro the original issue by writing a PHD test that creates a VM with a Crucible boot disk and a file-backed blank virtio data disk. If the data disk is an NVMe disk, the test passes, but there's a lot of complaining in the serial console about NVMe command timeouts. I suspect the firmware handles virtio disks slightly differently in a way that explains this behavior, but haven't spelunked through the FW debug logs enough to know for sure.

I wonder if boot order is why? Like it'll try reading disks til it finds one and NVMe is tried before virtio?

I think this tracks. Guest-initiated and API-initiated reboots end up going through the same state driver operations (pause/reset/resume devices)--the only difference is who pushes the button.

Sure, but I would've expected the guest-initiated reboot to actually get stuck at the same place

I wonder if boot order is why? Like it'll try reading disks til it finds one and NVMe is tried before virtio?

Could be--I'll take a closer look with some more local VMs. My recollection (from investigating omicron#5112, which was a while ago) is that our firmware image will enumerate all disks looking for boot options before it decides what the boot order is going to be, so my guess was/is that the firmware knows how to give up on NVMe commands but not on virtio IOs. But I'll need to do some more testing to be sure.

Sure, but I would've expected the guest-initiated reboot to actually get stuck at the same place

You're right, this doesn't make much sense. I'll take another look at this part as well (though I might wait for the dogfood rack to come back up to make it a little easier to make sure I have an accurate repro).