openbmc/phosphor-host-ipmid

Very intermittelty getting "Corrupted nv channel access file"

geissonator opened this issue · 3 comments

We at IBM have seen this intermittently over the years. We've seen on our older witherspoon and mowgli systems (AST2500) but also on our new p10bmc machines (AST2600). It's very intermittent though.

The first symptom you see is this in the journal:

Sep 14 20:14:37 mowgli ipmid[430]: terminate called after throwing an instance of 'std::runtime_error'
Sep 14 20:14:37 mowgli ipmid[430]:   what():  Corrupted nv channel access file
Sep 14 20:14:38 mowgli systemd[1]: phosphor-ipmi-host.service: Main process exited, code=killed, status=6/ABRT
Sep 14 20:14:38 mowgli systemd[1]: phosphor-ipmi-host.service: Failed with result 'signal'.

When you look at the file in question, /var/lib/ipmi/channel_access_nv.json, it's 0 in size:

--w-------    1 root     root             0 Aug 25 14:51 /var/lib/ipmi/channel_access_nv.json

I'm not sure how this file could end up being 0 size, but it does seem like a simple workaround is in the error path, https://github.com/openbmc/phosphor-host-ipmid/blob/master/user_channel/channel_mgmt.cpp#L1146, to just remove the file. That way when ipmi restarts, it will just re-init the files. Thoughts? I can throw up a quick patch if it make sense.

Unless someone can pinpoint the bug causing the intermittent file size 0 issue, I think our best bet at this point is to at least gracefully recover from the error.

So either we should add a "else if" at https://github.com/openbmc/phosphor-host-ipmid/blob/master/user_channel/channel_mgmt.cpp#L1111 that confirms the returned "data" is non-zero in size (and deletes file and returns -EIO if it is invalid) or we should add code in the exception clauses to delete the invalid file. It may be best to do both.

In summary, If the file is 0 in size or throws an exception during parsing, delete the file and throw the exception.

Testing is simple, load your code change and make an empty size file and restart ipmid to ensure it recovers.

rm /var/lib/ipmi/channel_access_nv.json
touch /var/lib/ipmi/channel_access_nv.json
systemctl restart phosphor-ipmi-host.service

@geissonator May I know what physical storage you are using for filesystem? flash part or eMMC? TIA

@geissonator May I know what physical storage you are using for filesystem? flash part or eMMC? TIA

We've seen this on both AST2500 (NOR chip) and AST2600 (eMMC). It recently resurfaced in our latest release on an AST2600.