riscv-non-isa/riscv-ras-eri

RERI Improvement Suggestions

Closed this issue · 2 comments

Summary:

  1. Number of error categories. Update from 3 error categories to 8 with accommodation for up to 16. Three categories is too limited to differentiate across status, overwriting, responses, and signaling.
  2. Error signaling. Each Each error category is mapped to a signal at the error record and has exactly 1 signal with merging done at the signal level. Suggested change to a) Each error category can trigger a signal at the appropriate local or global level b) Each signal can connect to interrupts or external pins. c) Each error category can be merged before generating the signal. A map per error record paradigm forces a specific topology and pin management challenge that is difficult for FuSa and may limit scalability for vendors and adaptability among vendors.
  3. Precise exception. Add a precise exception indication. Allow simple indication that the interrupted context is restartable after upper layer actions are taken.
  4. Error Record Invalidation. sinv embedded with other control bits. Add a interlock/semaphore mechanism to support atomicity. Prevent error record access hazards and provide assurance that writes do not invalidate unprocessed record updates.
  5. Affected structure detail. ec provides high level information. Expand the structure identification vocabulary to enable fine-grained identification. Detailed identification of the affected structure enables a robust response capability targeting the specific structure.
  6. Error Category Summation. Current error record has Single bit per error record static or dynamic valid in a way that enables programmatic access to error records. Update to Represent error categories and possibly the highest error category in a way that enables programmatic access to error records. RAS Agents should be able to quickly identify the significance of the error event in developing an error response. This is needed to support applications with reaction time requirements (e.g. FuSa)
  7. Error Notification Extension. Replace c and scrub bit in status register with a per category identification of key actions the RAS agent should/could take and insight into the nature of the error event that may be key to formulating this or future response. Each error event category has different “critical” information for the RAS Agent. Enable the RAS Agent to quickly identify which responses are supported by this implementation. This provides regular and easily decoded information for the RAS Agent.
  8. Information Identification Field. A field for each supported additional information register that indicates how the RAS Agent should decode the information. The hardware with the error record knows what is recorded and how the RAS Agent should decode it. Including it removes the need to consider vendor and implementation and record for each error record.

Further details in PDF provided in email.

Discussion on these items.

Number of error categories.

RERI provides a pri field that supports 4 categories each of UUE, UDE, and CE errors. Further four informational indications and an indication for no-error add up to a vocabulary of 17 categories and meets the requirement.

Error signaling.

The RERI specification includes the flexibility of mapping the signal to a variety of signaling and summarization topologies as needed. The per signal configuration bits are optional to implement (WARL). Signaling configuration is primarily static after initialization. Mapping of signals to specific topology/pins/merging is expected to be implementation specific and meets the requirement.

Precise exception

The Priv 1.13 specifies a hardware error exception [3] and an error record signal may cause this precise hardware error exception. The RAS handler may then read additional information from the RAS error record to determine the nature and cause of the exception. The hardware exceptions indicate occurrence of an UUE. While the error was caused by the context that encountered the error encountering corrupted data during its computation, the error record includes a c (containable) bit to indicate a) Indicate that the error has not propagated beyond the boundaries of the component that detected the error b) May be containable through recovery actions (e.g., terminating the computation, restart after correcting, etc.) carried out by the RAS handler. The term “containable” is used instead of restartable to make it applicable to components other than the RISC-V hart, such as the IOMMU, where a “restart” is not possible but we can still signal that the error has not propagated and a more severe action like resetting the IOMMU and the devices in its scope or the system itself is not required. Updating the non-nornative text around UUE with c=1 and precise exceptions in the specification to further clarify these items would be useful. It would be useful to note that the meaning of c for other error categories may be defined in future.

Error Record Invalidation

The current defined scheme involves three steps a) read record b) invalidate record c) read status to determine if overwrite occurred between step a and b. If a second error occurs between a and b and a third error between b and c then the overwrite that occurred between a and b will not be observed. Update status_i to not accept a SW write when v=1. Define a “Read in progress” bit - rdip in the status register. This bit is set by HW when the v bit goes from 0->1 and is cleared by HW when a new error overwrites a valid error record. control_i.sinv will not clear the status_i.valid bit if status_i.rdip bit is 0. Define a control_i.srdip control bit for SW to set the rdip if it is not already 1. The SW sequence is now a) Read record including status_i and Let S be the value of status_i. b) If S.rdip == 0 then write 1 to control_i.srdip and go to step a c) Write 1 to control_i.sinv d) Let S=status_i and if S.v == 1 && S.rdip == 0 goto step a.

Affected structure detail, Error Category Summation, Error Notification Extension, Information Identification Field

The RERI specification defines info_i, suppl_info_i, and addr_i register to provide further detailed information about the error. The format of the info_i, suppl_info_i, timestamp_i and addr_i registers are not standardized as there is lot of variety in which these fields will need to be defined to meet the needs for various hardware modules such as RISC-V harts, memory controllers, GPU PEs, accelerators, etc. Standardization of the structure of these fields may be pursued in a future segment specific standard extension.

To provide flexibility to use addr_i register for providing extended non-address information, rename the addr_i register as addr_ext_info_i and AT to AEIT such that the additional AEIT encodings may be define by future segment specific standard extensions to provide non-address extended information.

RERI provides a summary_valid register allowing quick identification of the record in the bank that has an error. Further summary registers are not required for most general compute platforms. Summation may be required for segment specific platforms and is to be pursued as a future standard extension. To allow for such extensions which may include a change to register layout, define a 3-bit header field to indicate layout. The layout=0 implies current layout. A future layout may define a bigger header or other changes to register layout to support summation. Reduce space set aside in header/record for vendor specific extension to provide space for future standard extensions.

PR #37 created to address discussion in this issue. Summary of updates as follows:

  1. Update description of the hardware error exception introduced in Priv 1.13.
  2. Update space designated for custom extension vs. reserved for future standard extensions.
  3. Introduce a read-in-progress field and associated controls to set it. Update position of sinv.
  4. Introduce a layout field to allow alternate standard register layouts to be defined.
  5. Rename addr_i to addr_info_i and note it may be used to report component specific information.