vmware-samples/nvme

Slow boot due to improper controller activation sequence

Closed this issue · 1 comments

Invalid controller activation sequences causes slow boot

In nvme_ctrlr.c:NvmeCtrlr_HwStart(), the controller is disabled, enabled, disabled again and finally enabled for good.
The first enable appears to be for QEMU, as the following comment appears before the code:

/*
* Note: on the Qemu emulator, simply write NVME_CC_ENABLE (0x1) to
* (regs + NVME_CC) is not enough to bring controller to RDY state.
* IOSQES and IOCQES has to be set to bring the controller to RDY
* state for the initial reset.
*/

This first enable sequence is done before the admin queu attributes are setup. According to the NVMe spec, enabling the controller before the Admin queue attributes are set will produce undefined results. On the HGST SN100 series SSDs, the results are the the controller enable is ignored. The driver will detect this as a timeout as CSTS.RDY is never set. The SN100 series drives hae a CAP.TO (Timeout) value indicating a 128 seconds.

The driver will continue and the device will come up, but the 128 second timeout delays the host boot. In hosts with multiple SN100 SSDs, the boot is delayed by by 128 times the number of devices.

I believe that copying the admin queue setup code from the next controller enable sequence to the first enable sequence will fix this.

i.e. the following code should be copied and inserted right after the qemu comment:

/* Set admin queue depth of completion and submission */
aqa = (sqInfo->qsize - 1) << NVME_AQA_SQS_LSB;
aqa |= (qinfo->qsize - 1) << NVME_AQA_CQS_LSB;

/* Set admin queue attributes */
Nvme_Writel(aqa, (regs + NVME_AQA));
Nvme_Writeq(qinfo->compqPhy, (regs + NVME_ACQ));
Nvme_Writeq(sqInfo->subqPhy, (regs + NVME_ASQ));

Thanks for reporting and sorry for the long-delayed response.
This issue has been fixed in ESXi 6.5 Update 1:
https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html