riscv-non-isa/riscv-iommu

Special restriction of ATC implementation comparing to the RISC-V IOMMU

Closed this issue · 14 comments

IOMMU specification enforces PCIe ATS “Success” Response for privilege related permission faults:

2.6. PCIe ATS translation request handling

If the translation request has a PASID with "Privilege Mode Requested" field set to 0, or the request does not have a PASID then the request does not target privileged memory. If the U-bit that indicates if the memory is accessible to user mode is 0 then a Success response with R and W bits set to 0 is generated.
If the translation request has a PASID with "Privilege Mode Requested" field set to 1, then the request targets privileged memory. If the U-bit that indicates if the page is accessible to user mode is 1 and the SUM bit in the ta field of the process-context is 0 then a Success response with R and W bits set to 0 is generated.

This looks to me to allow an PCIe ATS implementation to not to cache PRIV fields in the ATC as the PCIe specification suggests:

10.2.3.7 Privileged Mode Access (Priv)

Note: Since the Priv bit is Set only when the requesting Function Sets the Privileged Mode Requested bit, Functions that never set that bit should always receive the Priv bit Clear and thus don’t need to cache it.

And the reference model handles this as:

    // No faults are logged in the fault queue for PCIe ATS Translation Requests
    // with success response.
    if ( req->tr.at != ADDR_TYPE_PCIE_ATS_TRANSLATION_REQUEST ) {
        report_fault(cause, iotval, iotval2, TTYP, DTF,
                     req->device_id, req->pid_valid, req->process_id, req->priv_req);
        // Translated and Untranslated requests get UR response
        goto return_unsupported_request;
    }
    ...
    // When translation could not be completed due to PDT entry being not present, MSI
    // PTE being not present, or first and/or second stage PTE being not present or
    // misconfigured then a Success Response with R and W bits set to 0 is generated.
    // No faults are logged in the fault queue on these errors.
    // The translated address returned with such completions is undefined. The
    // following cause codes belong to this category:
    // * Instruction page fault (cause = 12)
    // * Read page fault (cause = 13)
    // * Write/AMO page fault (cause = 15)
    // * Instruction guest page fault (cause = 20)
    // * Read guest-page fault (cause = 21)
    // * Write/AMO guest-page fault (cause = 23)
    // * PDT entry not valid (cause = 266)
    // * MSI PTE not valid (cause = 262)
    if ( (cause = 12) || (cause == 13) || (cause == 15) || (cause == 20) || (cause == 21) ||
         (cause == 23) || (cause == 266) || (cause == 262) ) {
        // Return response with R=W=0. The rest of the field are undefined. The reference
        // model sets them to 0.
        vs_pte.raw = g_pte.raw = 0;
        is_msi = 0;
        page_sz = PAGESIZE;
        goto step_20;
    }

However PCIe ATS specification also claims:

10.2.3.7 Privileged Mode Access (Priv)

Functions may optionally check that when the Privileged Mode Requested bit is Clear in a Translation Request, the Priv bits in the associated Translation Completion Data Entries are also Clear. If this optional check fails, the Function shall signal Unexpected Completion (UC).

Since an ATC implementation may also choose to implement the above mentioned optional check, returning an ATS_TRANS_FAULT response could also help the ATC side which may cache the fault type to accelerate its check implementation.

Should we remove the "Success" response type enforcement and leave it for the implementation to determine?

In ATS there are 4 types of responses:

  1. Success - with R!=0 and/or W!=0 - this encoding indicates that a translation was completed and a subset of permissions were granted.
  2. Success - with R=W=0 - this encoding indicates that the ATC must not use the address field to make subsequent requests. The TA returns this response when ATS requests are permitted but the translation resulted in a fault. The text quoted from section 2.6 is the faulting conditions associated with privileged requests. The ATC is not allowed to cache these responses.
  3. Unsupported Request (UR) - indicates use of ATS is not supported for this function.
  4. Completer Abort (CA) - indicates that the TA encountered errors during the processing.

For an implementation that uses DTI: the case 1 results in a ATS_TRANS_RESP, case 2 results in ATS_TRANS_FAULT with faulty type set to Invalid Translation, case 3 results in ATS_TRANS_FAULT with faulty type of UR and case 4 with fault type of CA.

Should we remove the "Success" response type enforcement and leave it for the implementation to determine?

For this case, returning UR is not appropriate as ATS is supported and returning CA is not appropriate since TA did not encounter any errors. Returning a Success with permission is not right since there are no permissions. Hence the Success with R=W=0 to indicate that the requested translation result in a page fault.

This looks to me to allow an PCIe ATS implementation to not to cache PRIV fields in the ATC as the PCIe specification suggests:

Consider a mapping with U=0 (user mapping) and where the SUM=1 (supervisor access to user mappings allowed). In this case a Priv=0 request will succeed since its a user mapping. A Priv=1 request will also succeed since SUM=1. An ATC that makes two such requests is required to cache two distinct translations - one with Priv=0 and one with Priv=1.

That makes it very clear. Thanks for the explanation.

case 2 results in ATS_TRANS_FAULT with faulty type set to Invalid Translation,

However, this clarification is still conflict with the reference model:

    // No faults are logged in the fault queue for PCIe ATS Translation Requests
    // with success response.
    if ( req->tr.at != ADDR_TYPE_PCIE_ATS_TRANSLATION_REQUEST ) {
        report_fault(cause, iotval, iotval2, TTYP, DTF,
                     req->device_id, req->pid_valid, req->process_id, req->priv_req);
        // Translated and Untranslated requests get UR response
        goto return_unsupported_request;
    }
    ...
    // When translation could not be completed due to PDT entry being not present, MSI
    // PTE being not present, or first and/or second stage PTE being not present or
    // misconfigured then a Success Response with R and W bits set to 0 is generated.
    // No faults are logged in the fault queue on these errors.
    // The translated address returned with such completions is undefined. The
    // following cause codes belong to this category:
    // * Instruction page fault (cause = 12)
    // * Read page fault (cause = 13)
    // * Write/AMO page fault (cause = 15)
    // * Instruction guest page fault (cause = 20)
    // * Read guest-page fault (cause = 21)
    // * Write/AMO guest-page fault (cause = 23)
    // * PDT entry not valid (cause = 266)
    // * MSI PTE not valid (cause = 262)
    if ( (cause = 12) || (cause == 13) || (cause == 15) || (cause == 20) || (cause == 21) ||
         (cause == 23) || (cause == 266) || (cause == 262) ) {
        // Return response with R=W=0. The rest of the field are undefined. The reference
        // model sets them to 0.
        vs_pte.raw = g_pte.raw = 0;
        is_msi = 0;
        page_sz = PAGESIZE;
        goto step_20;
    }

And RISC-V IOMMU specification:

When translation could not be completed due to the following causes a Success Response with R and W bits set to 0 is generated. No faults are logged in the fault queue on these errors. The translated address returned with such completions is UNSPECIFIED.
• Instruction page fault (cause = 12)
• Read page fault (cause = 13)
• Write/AMO page fault (cause = 15)
• Instruction guest page fault (cause = 20)
• Read guest-page fault (cause = 21)
• Write/AMO guest-page fault (cause = 23)
• PDT entry not valid (cause = 266)
• MSI PTE not valid (cause = 262)

In case of DTI ATS_TRANS_FAULT, a fault record should be generated into fault queue accordingly and I did see other IOMMU implementations generates such special cased permission error faults for PCIe ATS.

In case of DTI ATS_TRANS_FAULT, a fault record should be generated into fault queue accordingly

Generating a fault queue record when Success response is returned is incorrect as there is no action required by the OS for these. These are recoverable faults. In response to these page faults, the ATS is expected to produce a Page Request, if PRI is enabled, asking the OS to make the page resident or update the page permissions and if PRI is not enabled to invoke assistance of its driver.

However, this clarification is still conflict with the reference model:

You may be having a bit older code. Suggest synching again with main.

and I did see other IOMMU implementations generates such kind of permission error faults.

Perhaps referencing the the sections that deal with - responses to ATS translation requests - of those IOMMUs might be helpful.

Generating a fault queue record when Success response is returned is incorrect as there is no action required by the OS for these. These are recoverable faults. In response to these page faults, the ATS is expected to produce a Page Request, if PRI is enabled, asking the OS to make the page resident or update the page permissions and if PRI is not enabled to invoke assistance of its driver.

Yes, I see.

But for the above mentioned special case permission faults, IMO, fault should be recorded and accounted, so that system administrator can notice a wrong device configuration or a potential privacy/security attack.

But for the above mentioned special case permission faults,

There is no reason to cause faults to the OS for permission faults. Faults are reported if a UR or CA response was produced. The Success with R=W=0 response does not carry any other information to risk a privacy or security attack.

I'm a bit confused by your replies. Let me step back to confirm if we have synchronized to the same page.

Before I reponed the issue, my understanding is we both agreed:

  1. There is a special case among page faults, invalid pdtes, invalid msiptes matching "PCIe ATS Success translation completion w/ R/W clear".
  2. The special case (let me call it as PRIV faults hereafter) is related to the U-mode DTI_ATS_TRANS_REQ access of P-mode pages, or P-mode DTI_ATS_TRANS_REQ access of U-mode pages w/o SUM=1.
  3. Per-DTI specification, such "PCIe ATS Success Response w/ R/W clear" is caused by "DTI_ATS_TRANS_FAULT whose FAULT_TYPE is InvalidTranslation".
  4. The DTI master which is supposed to reside in PCIe ATC (in most cases, PCIe RC) should be responsible to reply "ATS Success Completion w/ R/W clear" upon receiving InvalidTranslation DTI_ATS_TRANS_FAULT.

Correct me if my above understanding is wrong.

Please see section 2.6.

  1. These faults result in completer abort (CA) response to the requester. These are also reported in the fault queue to the OS.
  • Instruction access fault (cause = 1)
  • Read access fault (cause = 5)
  • Write/AMO access fault (cause = 7)
  • MSI PTE load access fault (cause = 261)
  • MSI PTE misconfigured (cause = 263)
  • PDT entry load access fault (cause = 265)
  • PDT entry misconfigured (cause = 267)
  1. These faults result in Unsupported Request (UR) response to the requester. These are also reported in the fault queue to the OS.
  • All inbound transactions disallowed (cause = 256)
  • DDT entry load access fault (cause = 257)
  • DDT entry not valid (cause = 258)
  • DDT entry misconfigured (cause = 259)
  • Transaction type disallowed (cause = 260)
  1. These faults result in Success with R=W=0 response to the requester. The are permission faults - including checking for User/Supervisor access permissions. These faults are not reported in fault queue to the OS.
  • Instruction page fault (cause = 12)
  • Read page fault (cause = 13)
  • Write/AMO page fault (cause = 15)
  • Instruction guest page fault (cause = 20)
  • Read guest-page fault (cause = 21)
  • Write/AMO guest-page fault (cause = 23)
  • PDT entry not valid (cause = 266)
  • MSI PTE not valid (cause = 262)
  1. If none of these faults occur then a Success response with available permissions is returned to the requester.

For an implementation that uses DTI: the case 4 results in a ATS_TRANS_RESP, case 3 results in ATS_TRANS_FAULT with faulty type set to InvalidTranslation, case 2 results in ATS_TRANS_FAULT with faulty type of UR and case 1 with fault type of CA.

hi,ved
i want to confirm the hardware implementation.
based on what you two discussed above, when case 3 happen, IOMMU need to reply a response DTI messasge to PCIe, and this response is a DTI_ATS_TRANS_FAULT type message?Further, we should set the Fault Type field to InvalidTranslation?
If so, how do you embody the implementation mentioned in the spec section 2.6 :"then a success response withR=W=0 is generated",because DTI_ATS_TRANS_FAULT message has not any field to allocate R/W bit info.

This is exactly what our concern is:

  1. These faults result in Success with R=W=0 response to the requester. The are permission faults - including checking for User/Supervisor access permissions. These faults are not reported in fault queue to the OS.
  • Instruction page fault (cause = 12)
  • Read page fault (cause = 13)
  • Write/AMO page fault (cause = 15)
  • Instruction guest page fault (cause = 20)
  • Read guest-page fault (cause = 21)
  • Write/AMO guest-page fault (cause = 23)
  • PDT entry not valid (cause = 266)
  • MSI PTE not valid (cause = 262)

According to our compliance tests, other implementation do generate faults into fault queue for all case 3 page faults.

My opinions are:

  1. At least, we should split PRIV faults from all those page faults as: ATS specification allows a function to optionally support PRIV checks; and since PRIV faults are permission faults that could reflect a false device configuration needing administrative care, so far we haven't seen a reason to make these faults silent;
  2. If we do want to decrease fault records of page faults as ATS may send consequent PRIs, we should only make V=0 page faults unlogged while other page faults logged into FQ;
  3. And considering there are cases some PCIe functions have ATS cap w/o PRI cap, and we have single DC.tc.EN_PRI control bit to obtain similar usage model differentiation, we may log all page faults including V=0 faults;
  4. We shouldn't do such special cased fault silence to make hardware design too complicated with no real usage model benefits, which may not follow RISC-V simple and fast design philosophy;
  5. As a consequence, it looks to us the spec contains too many PCIe RC design behaviors rather than IOMMU's, and if not clarified, these statements finally resulted in many wrong ATS special constraints in the reference model.

I can help to answer part of the questions:

i want to confirm the hardware implementation. based on what you two discussed above, when case 3 happen, IOMMU need to reply a response DTI messasge to PCIe, and this response is a DTI_ATS_TRANS_FAULT type message?Further, we should set the Fault Type field to InvalidTranslation?

I think we've already reached the agreement according to the DTI specification:

5.2.3 DTI_ATS_TRANS_FAULT

When the value of this field is InvalidTranslation, this field indicates that ATS requests are permitted but that the translation resulted in a fault. The PCIe RP returns a Translation Completion message with the status value as Success and with the Read and Write bits clear.

image

I can help to answer part of the questions:

i want to confirm the hardware implementation. based on what you two discussed above, when case 3 happen, IOMMU need to reply a response DTI messasge to PCIe, and this response is a DTI_ATS_TRANS_FAULT type message?Further, we should set the Fault Type field to InvalidTranslation?

I think we've already reached the agreement according to the DTI specification:

5.2.3 DTI_ATS_TRANS_FAULT

When the value of this field is InvalidTranslation, this field indicates that ATS requests are permitted but that the translation resulted in a fault. The DTI master returns a Translation Completion message with the status value as Success and with the Read and Write bits clear.

thanks a lot, i think there should be no doubt about this part

According to our compliance tests, other implementation do generate faults into fault queue for all case 3 page faults.

Did you mean other architectures or implementations of the RISC-V IOMMU architecture? While I cannot comment about other architectures I would suggest a more careful reading of their specifications around topics like responses to ATS translation requests. Causing a fault into fault queue is not useful here because there is no action needed from the system software.

At least, we should split PRIV faults from all those page faults as: ATS specification allows a function to optionally support PRIV checks; and since PRIV faults are permission faults that could reflect a false device configuration needing administrative care, so far we haven't seen a reason to make these faults silent;

Privilege checks are also permission checks and when a permission is denied the response is a page fault - Success with R=W=0. The function is required to treat permissions for requests with privilege mode separately from requests without privilege mode. In such cases the device in response to a page fault response may generate a page request for invoking OS help.

If we do want to decrease fault records of page faults as ATS may send consequent PRIs, we should only make V=0 page faults unlogged while other page faults logged into FQ;

Permission faults do not need OS intervention. If OS or driver intervention is required then the function will issue required PRIs.

And considering there are cases some PCIe functions have ATS cap w/o PRI cap, and we have single DC.tc.EN_PRI control bit to obtain similar usage model differentiation, we may log all page faults including V=0 faults;

When a function receives a page fault response to an ATS request then what action it performs is device implementation specific. It may invoke a page request using its PRI capability or it may invoke a assistance from its driver in a device specific manner.

We shouldn't do such special cased fault silence to make hardware design too complicated with no real usage model benefits, which may not follow RISC-V simple and fast design philosophy;

Causing the fault is useless and just introduces un-actionable noise to the OS.

Causing a fault into fault queue is not useful here because there is no action needed from the system software.
Causing the fault is useless and just introduces un-actionable noise to the OS.

I will check with this in details and close the issue now.
Thanks for the conversation and clarification.