RAS counter information is inadequate

Question

eero-t opened this issue 2 years ago · 1 comments

While current RAS counter information can be useful for driver developers, IMHO it does not really suffice for managing cluster of devices.

This is because:

There's no definition what "UNCORRECTABLE" means: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-type-t
Error information is too HW specific. It does not provide information whether FW / kernel were able to mitigate the issue i.e. whether higher level mitigations are needed or not

I think RAS counters should provide following information...

A) What impact/fatality the issue has i.e. what mitigations are required:

No impact, HW/FW/kernel fixed it transparently => monitor whether HW replacement needs to be scheduled
Current workload context lost, but functionality recovered (single engine reset)
All workload contexts on given engine type lost, but functionality recovered (engine type reset)
All workload contexts on whole device / subdevice lost, but functionality recovered (e.g. by bus reset)
Device needs to be rebooted => disable workload scheduling to device, alert admin
HW (e.g. memory) needs to be replaced before device is operable again => disable workload scheduling to device, alert admin

B) What caused the issue, i.e. what mitigations are required:

Workload itself (it e.g. programs HW wrong) => if recurring, ban workload, and alert admin to update workload
Issue in how FW/kernel uses HW => if recurring, disable all relevant devices in given node and alert admin to update FW/kernel
Issue in HW => if recurring, disable given device and alert admin to replace HW

C) Counters for kernel level device usage issues:

Mainly OOM counter, i.e. how many times workloads had to be killed due to device running out of memory (which can also cause extreme slowdowns due to extra paging)

Answer 1 · 2023-02-28T10:34:35.000Z

Moved this to spec project instead.