oneapi-src/level-zero

RAS counter information is inadequate

eero-t opened this issue · 1 comments

While current RAS counter information can be useful for driver developers, IMHO it does not really suffice for managing cluster of devices.

This is because:


I think RAS counters should provide following information...

A) What impact/fatality the issue has i.e. what mitigations are required:

  • No impact, HW/FW/kernel fixed it transparently => monitor whether HW replacement needs to be scheduled
  • Current workload context lost, but functionality recovered (single engine reset)
  • All workload contexts on given engine type lost, but functionality recovered (engine type reset)
  • All workload contexts on whole device / subdevice lost, but functionality recovered (e.g. by bus reset)
  • Device needs to be rebooted => disable workload scheduling to device, alert admin
  • HW (e.g. memory) needs to be replaced before device is operable again => disable workload scheduling to device, alert admin

B) What caused the issue, i.e. what mitigations are required:

  • Workload itself (it e.g. programs HW wrong) => if recurring, ban workload, and alert admin to update workload
  • Issue in how FW/kernel uses HW => if recurring, disable all relevant devices in given node and alert admin to update FW/kernel
  • Issue in HW => if recurring, disable given device and alert admin to replace HW

C) Counters for kernel level device usage issues:

  • Mainly OOM counter, i.e. how many times workloads had to be killed due to device running out of memory (which can also cause extreme slowdowns due to extra paging)

Moved this to spec project instead.