RAS counter information is inadequate
eero-t opened this issue · 1 comments
eero-t commented
While current RAS counter information can be useful for driver developers, IMHO it does not really suffice for managing cluster of devices.
This is because:
- There's no definition what "UNCORRECTABLE" means: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-type-t
- Error information is too HW specific. It does not provide information whether FW / kernel were able to mitigate the issue i.e. whether higher level mitigations are needed or not
I think RAS counters should provide following information...
A) What impact/fatality the issue has i.e. what mitigations are required:
- No impact, HW/FW/kernel fixed it transparently => monitor whether HW replacement needs to be scheduled
- Current workload context lost, but functionality recovered (single engine reset)
- All workload contexts on given engine type lost, but functionality recovered (engine type reset)
- All workload contexts on whole device / subdevice lost, but functionality recovered (e.g. by bus reset)
- Device needs to be rebooted => disable workload scheduling to device, alert admin
- HW (e.g. memory) needs to be replaced before device is operable again => disable workload scheduling to device, alert admin
B) What caused the issue, i.e. what mitigations are required:
- Workload itself (it e.g. programs HW wrong) => if recurring, ban workload, and alert admin to update workload
- Issue in how FW/kernel uses HW => if recurring, disable all relevant devices in given node and alert admin to update FW/kernel
- Issue in HW => if recurring, disable given device and alert admin to replace HW
C) Counters for kernel level device usage issues:
- Mainly OOM counter, i.e. how many times workloads had to be killed due to device running out of memory (which can also cause extreme slowdowns due to extra paging)
eero-t commented
Moved this to spec project instead.