openbmc/phosphor-host-ipmid

Hard power off sends incorrect state transition request

jk-ozlabs opened this issue · 9 comments

In the handler for chassis control commands, a power down request will cause phosphor-host-ipmid to set the host state to Off:

rc = initiate_state_transition(State::Host::Transition::Off);

However, the (hard) power down request (ie, Request Data byte 1 == 0) should trigger an immediate power down; this is implemented by requesting a chassis state transition to Off. This is specified in https://github.com/openbmc/docs/blob/master/designs/state-management-and-external-interfaces.md#proposed-design .

I see that there seems to be out-of-band signalling using an external file, plus interaction with a soft-off service. Can we reduce the complexity there, by having ipmid do the right thing with the state object, and have those objects implement the soft-off procedure internally?

I see that there seems to be out-of-band signalling using an external file, plus interaction with a soft-off service. Can we reduce the complexity there, by having ipmid do the right thing with the state object, and have those objects implement the soft-off procedure internally?

The main issue I've had every time I try to clean this path up is that we over-utilized the chassis off command. The host uses it via inband to notify the BMC of a power off (i.e. host got a shutdown command, did all of it's shutting down, and now just wants the bmc to power off). The command can also some in via out of band, and in that case, the BMC must send a message up to the host and wait for it to complete the power off and then send that same chassis off message via inband (this is the checking for the soft off file and such). The dual use of this command is what causes a lot of our complexity. Ideally the host would had have a different IPMI command to indicate "I'm done with my shutdown".

For PLDM, we have thankfully defined two separate mechanisms for this, which simplifies things.

The main issue I've had every time I try to clean this path up is that we over-utilized the chassis off command. The host uses it via inband to notify the BMC of a power off (i.e. host got a shutdown command, did all of it's shutting down, and now just wants the bmc to power off). The command can also some in via out of band, and in that case, the BMC must send a message up to the host

That's not correct though - the out-of-band implementation should do exactly the same thing - cut power immediately. There is no message to the host involved in that path.

Only the soft-off (0x5) needs to send a message to the host.

Ideally the host would had have a different IPMI command to indicate "I'm done with my shutdown".

That's just "cut power to the host"; that's all the BMC needs to do. There is no need for a separate command.

The issue is we tied the host behavior, the 30 seconds to acknowledge, and the 2 hours to send the chassis power off to a service that is attached to this chassis off command. In hindsight, a service that sits off to the side and monitors dbus signals probably would have been better but the two pieces, the "did the host send the special IPMI message acknowledging the soft power off" and then the 2 hour timeout are intertwined with that chassis off.

I had a commit somewhere I thought where I took a stab at this, can't find it though. But the next issue I ran into was when we added support for warm reboots. In that case, a graceful warm reboot of the host uses this same path, but we can not power off host. i.e. there is no separate mechanism for the host to just say "I've shutdown" via IPMI. We've overloaded that chassis off IPMI command to indicate both a request to power off and a "I've shutdown". Because of our intertwining of that chassis command and the soft off application, I was able to make it all work with what we had.

Hi Andrew, thanks for the detail. A couple of responses:

to send the chassis power off to a service that is attached to this chassis off command.

Not sure I understand this sentence.

The Chassis Control with data = 0 shouldn't need any timeout, or have any interaction with the host. Just a synchronous removal of power from the chassis.

[There's a potential interaction to cancel a pending timeout which may have been started as part of a soft-off process, but in hindsight I think that timeout is not useful anyway; that's a separate discussion]

I had a commit somewhere I thought where I took a stab at this, can't find it though. But the next issue I ran into was when we added support for warm reboots. In that case, a graceful warm reboot of the host uses this same path, but we can not power off host.

The warm reboot should never hit this path either; the host will not issue a hard power-down (data = 0) in the warm reboot process.

The warm reboot implementation should be a single message from BMC to host (we do this via the SEL event), to indicate that the OS should reboot. It's entirely up to the host to implement the reboot itself. It's probably going to end up (after the OS has quiesced) with host firmware sending a hard power cycle command (data = 3), but that's entirely up to host firmware (eg, it may implement the reboot via other mechanisms, like a kexec).

Because of our intertwining of that chassis command and the soft off application, I was able to make it all work with what we had.

Right, and that sounds overcomplicated. If we need a timeout on the soft-off, then the soft-off application only needs to monitor for a timeout of a host-off, and "upgrade" it to a chassis-off if necessary. But, I believe that the timeout doesn't serve any purpose anyway, in which case we would not need a soft-off application, and the whole thing could be stateless.

The warm reboot implementation should be a single message from BMC to host (we do this via the SEL event), to indicate that the OS should reboot. It's entirely up to the host to implement the reboot itself. It's probably going to end up (after the OS has quiesced) with host firmware sending a hard power cycle command (data = 3), but that's entirely up to host firmware (eg, it may implement the reboot via other mechanisms, like a kexec).

A warm reboot via the/ redfish/v1/Systems/system/Actions/ComputerSystem.Reset interface is not just a reboot of the OS, it's a reboot of the whole system. So the BMC has to coordinate a shutdown of the host, and then trigger a reboot of the BIOS (all while keeping chassis power on). Is there another IPMI command which could be used for this situation? Something to tell the host to shutdown and notify the BMC when it is done (outside of the chassis off command)?

Hi Andrew,

A warm reboot via the/ redfish/v1/Systems/system/Actions/ComputerSystem.Reset interface is not just a reboot of the OS, it's a reboot of the whole system.

There's no specific "warm reboot" defined in redfish; are you referring to ComputerSystem.Reset with ResetType == GracefulRestart?

None of the above assumes that we're only rebooting the OS. My example above will re-enter firmware too.

So the BMC has to coordinate a shutdown of the host, and then trigger a reboot of the BIOS (all while keeping chassis power on).

The Redfish spec doesn't define whether the power is kept on. While we can try and add specific definitions here, there will need to be some scope for platform-dependent behaviour.

The BMC doesn't necessarily have to coordinate much here. For example, on OpenPOWER implementations, a graceful reboot will cause the OS to quiesce, and end up calling into OPAL firmware, which then ends up sending an IPMI Chassis Control (requestData = 0x3) message, which is what causes the reset. The BMC hasn't needed to track state there.

There is one potential case where we will need the BMC to catch the end of a host transition: a graceful reboot, with chassis power cycle. In this case, the BMC needs to interpret the reset at the end of the host quiesce as a trigger to toggle power, rather than just entering the reset.

Otherwise, I can't see any case for state to be kept over a reboot.

Regardless, none of this should be applicable to the IPMI stack - ideally it should just translate the incoming chassis control command to invoke a chassis or state transition, and that's all. Otherwise we have to duplicate that logic in other places (Redfish interface, obmcutil...). Of course, those transitions need to map to the correct Chassis Control commands, which is what this bug is all about :)

Is there another IPMI command which could be used for this situation?

No, but that's OK. We only have 5 potential power control commands available over IPMI (or 6 if you include the NMI pulse), but we have 8 state transitions in OpenBMC. That's always going to mean that some of the full set of transitions cannot be triggered via IPMI.

I think this would be worth documenting in a separate design document - I'll work on that now, and propose a change to the docs repo

Hey Jeremy,

It's nice to have someone else looking at this, it's always been a bit of a thorn in my side :)

The Redfish spec doesn't define whether the power is kept on. While we can try and add specific definitions here, there will need to be some scope for platform-dependent behaviour.

I did spend some time talking with John Leung from Intel. He's on the DMTF and had a nice presentation on Redfish interfaces and mapping those to BMC behavior. Based on that, Jason Bills and I wrote the below doc:

https://github.com/openbmc/docs/blob/master/designs/state-management-and-external-interfaces.md

Otherwise we have to duplicate that logic in other places (Redfish interface, obmcutil...). Of course, those transitions need to map to the correct Chassis Control commands, which is what this bug is all about :)

systemd targets help alleviate the duplicate logic concern a bit. They all just map to the same openbmc systemd targets.

Hi Andrew,

I did spend some time talking with John Leung from Intel. He's on the DMTF and had a nice presentation on Redfish interfaces and mapping those to BMC behavior. Based on that, Jason Bills and I wrote the below doc:

https://github.com/openbmc/docs/blob/master/designs/state-management-and-external-interfaces.md

Yep, that's what I referenced in the first comment - and the issue is that we're not complying with that doc in phosphor-host-ipmi: there's no Chassis::Transition::Off invoked in the CMD_POWER_OFF case.

This manifests in x86-power-control never being able to hard-power-off a machine over IPMI.

systemd targets help alleviate the duplicate logic concern a bit. They all just map to the same openbmc systemd targets.

Sure, but we're not invoking the correct transitions out of the Chassis Control commands. Maybe there's some phosphor-state-manager-specific code to convert that Host::Transition::Off (when called as a hard-power-off over IPMI) into a Chassis::Transition::Off in an external service, but that seems like a strange way to implement something that could be direct.

Hey Jeremy, sorry for the lack of response here. I see your point, on our systems a host power off (with no soft off code running) is basically the same as a chassis off. We try to stop the occ and processor instructions but that's about it. On other systems though there may be more behind the host power off which does not match with the expected result of a hard chassis power off IPMI command.

The issue I don't know how to resolve is that on a warm reboot (keeping chassis power on), we need an indication from the host that it is done with it's shutdown, and the BMC can do whatever it needs to do to clean things up and re-start the boot (all while keeping chassis power on). Currently the only thing I know of is this CMD_POWER_OFF as an indication from the host that it is done shutting down.