riscv-non-isa/riscv-ras-eri

SPEC positioning

space-mit opened this issue · 3 comments

Hi, there

I'm trying to understand the positioning of this specification.
Please take a look at ARM RAS architecture:

arm-ras

I'm confused by the RERI descriptions of being a standard of RAS SoC for RISC-V.

Is RERI supposed to be used as the CPU RAS extension like ARM DDI0587D, which is a hardware standard, but for CPU only.
Or is RERI supposed to be used as the RISC-V RAS firmware extension like SDEI of ARM DEN0054C?
It looks to me that the RERI can also be implemented by RAS firmware using specific communication hardware like PCC channels mentioned in ACPI.

Shall we have this clarified here or in the spec documentation?

Thanks in advance

Is RERI supposed to be used as the CPU RAS extension

The RAS Error Record Register Interface (RERI) specification augments Reliability, Availability, and Servciceability (RAS) features in the SoC with a standard mechanism for reporting errors by means of a memory-mapped register interface. Components (e.g., a RISC-V hart, a memory controller, etc.) in a system that support error detection may implement one or more banks of error records defined by RERI.

Or is RERI supposed to be used as the RUSC-V RAS firmware extension

Traditionally there has been two methods through with RAS errors have been handled - firmware-first and kernel-first. Most applications may prefer to implement firmware-first handling of RAS errors but some applications may prefer kernel-first. RERI supports either methods error handling. The RERI TG is also defining the UEFI, ACPI, and SBI extensions for RERI and discussions on this are happening in the PRS TG. This includes:

  • UEFI extensions to commoc platform error record (CPER)
  • ACPI extensions to GHES, HEST, BERT, ERST, EINJ tables
  • SBI extensions for supervisor software events

The software stack for RISC-V firmware-first model RAS handling is shown in slide 9 of this [overview] (https://docs.google.com/presentation/d/1QI7sLrOvdOliAF86CgefZK98WoZsLwiW8DoEkvWj2Vw/edit#slide=id.p).

The RERI specification provides the specification for the RAS error-banks and error-records shown as the "Error Record Register" at the bottom of the stack.

Shall we have this clarified here or in the spec documentation?

The first paragraph of chapter 1 outlines what the RERI specification provides. The section 1.5 summarizes the RERI features. The chapter 2 provides details on the register interface itself.

I hope that addressed the question.

That makes senses to me.
Thanks for the clarifications.

-Cheers