nxp-mcuxpresso/mcux-sdk

FLEXCAN workaround for chip errata 6032 may cause intermittent RX FIFO overflows in an RTOS multithreaded environment.

Closed this issue · 6 comments

Is your enhancement proposal related to a problem? Please describe.
We were updating our project from SDK version 2.4.2 to 2.10.0 when we observed intermittent RX FIFO overflows on FLEXCAN. After much debugging, we were able to narrow it down to the chip errata 6032 workaround function causing this issue. We run a multithreaded RTOS, and whenever a thread wanted to send a frame on CAN, it could disable all FLEXCAN interrupts, then context switch to another task leaving FLEXCAN interrupts disabled, eventually causing the 6 frame RX FIFO to overflow.

Describe the solution you'd like
Either an option added to disable the 6032 errata workaround, or to have it implemented in such a way that interrupts cannot be disabled long enough to overflow the RX FIFO. I am willing to contribute a pull request for one of these options if you can provide some guidance on which approach is preferred.

Describe alternatives you've considered
We have locally modified the SDK to disable the workaround and confirmed we no longer get intermittent RX FIFO overflows.

Additional context
Similar data loss issues also appear to be present in the implementation of LPUART. Whenever transmit/receive functions are called from an RTOS task, LPUART interrupts are disabled/enabled and if a task switch occurs or some higher priority task is running, received data on the LPUART can be lost.

Hi @abascom, thanks for reaching us and reporting the issue here.

After discussed with developer, I got the feedback as below:

  • For the workaround of errata 6032, we do not suggest you to remove that because possibly you will met below error as errata states: "A frame with wrong ID or payload is transmitted into the CAN bus when the Message Buffer under transmission is either aborted or deactivated while the CAN bus is in the Bus Idle state".
  • The developer suggest to update the chip errata 6032 workaround function. Currently the function disables all FlexCAN interrupts before MB configuration, and it could results issue when context is switched to other tasks leaving the FlexCAN interrupt disabled, cause RX overflow. The developer suggest to update the disable all FlexCAN interrupts to disable all interrupts. In this way could avoid the context switch happens before MB configuration is written.

The developer appreciate if you could help test the suggestion and contribute via PR back to us.

We can certainly give that a try.

With our CAN bus utilization, we estimate we could fill the RX FIFO in ~1ms. Does the developer have any data around how long the busy waits in the 6032 errata workaround could take in the worst case?

@abascom, yes, please have a try. For your question related to the busy waits, I will provide the feedback from developer later, as he is currently on Holiday leave. Appreciate for your patience.

I have tested the suggested change it is has been working well (I've left it running for a couple of days with regular CAN traffic).

I do have another question around the errata workaround.
"

  1. Disable all interrupt
  2. Read CAN_DBG1.CFSM and CAN_DBG1.CBN fields.
  3. Check if CFSM value is either BUSIDLE, RXINTERMISSION or TXINTERMISSION.
    For the later two values, also check if CBN value is 3, to determine the paired conditions
    RXINTERMISSION bit 3 or TXINTERMISSION bit 3, and processed as described below.
    3.1. If CAN_DBG1 fields indicate BUSIDLE, wait N CPU clocks.
    3.2. Else if CAN_DBG1 fields indicate either RXINTERMISSION bit 3 or TXINTERMISSION
    bit 3 wait until CFSM is different from either RXINTERMISSION or TXINTERMISSION.
    3.3. Check again CAN_DBG1 fields, if they indicate BUSIDLE, wait for DELAY time.
  4. Write 0x0 into Code field of CS word.
  5. Enable all interrupts.
  6. Write the ID word.
  7. Write the DATA words.
  8. Write 0xC into Code field of CS word.
    "

I have tested changing step 1 and 5 to disable/enable all interrupts instead of just flexcan interrupts to get rid of context switching. However even doing that, after enabling interrupts in step 5, the OS could context switch and steps 6, 7, and 8 could happen some time much later. Is this fine? Or should the enable all interrupts part actually be moved to the last step?

I am also still curious on the busy wait worst case timing question as soon as the developer is back and able to help answer.

Thanks for your help!

hi @abascom, I'm the SDK flexcan driver developer.
For the first question, base on my understanding of the problem, the worst case wait time before sending each frame is 2 frame bits of time (if CAN bitrate is 500kbps, the wait time is 4us).

For the second question, it is no need to move enable all interrupts to the last step. For this errata, there is no issue if the CPU guarantees that any MB configured for transmission will not be aborted or deactivated (write 0x0 into CS Code field is an abort operation) in a short interval. If we do not turn off interrupts before doing step 4, we cannot guarantee that this MB has not just been configured, however, once we have performed step 4, neither reconfiguring the MB nor aborting it again will cause an error to be generated, so there is no need to disable the interrupt.

I've entered pull request (#72) for this issue.