Possible race condition between Uart::write(data) and Uart::IrqHandler() ??
Opened this issue · 4 comments
I am working with a MKRNB1500 with MKRNB. Sometimes I see what appears to be an incomplete or no response from the SARA modem. The interface to the SARA modem is a UART serial interface.
I started looking at the UART interface code and I think I have identified a possible race condition between the Uart::write(data) method and the Uart::IrqHandler(). Here is the sequence:
- Upon completion of a string write operation to the UART interface that uses the transmit RingBuffer the UDRE bit is set after the write of the last char removed from the transmit RingBuffer an written to the UART interface. Also, the UDRE interrupts are disabled in the Uart::IrqHandler() when there are no more characters in the transmit RingBuffer (see line 117 in Uart.cpp). This is normal operation.
- The next time Uart::write(data) is called the if logic at line 160 of Uart.cpp is executed because URDE is set and the transmit RingBuffer is empty. The write of the char to the UDR register clears the UDRE bit.
- If Uart::write(data) is called again before the UDRE bit has been set (i.e. before the char has been moved to the Transmit Shift Register), the else logic is executed.
- Assume that BEFORE the else logic can store the character in the RingBuffer (line 185 in Uart.cpp) the UDRE bit is set AND the Uart: IrqHandler() is driven by the TXC interrupt. Note that UDRE interrupts are still disabled so we don’t get a UDRE interrupt.
- The Uart::IrqHandler() suspends the Uart::write() processing and even though the UDRE bit is set it doesn’t find anything on the transmit RingBuffer because of step 4.
- The Uart::write() else logic resumes after the Uart::IrqHandler has exited and adds the char to the transmit RingBuffer and enables UDRE interrupts. Unfortunately, this char and all subsequent write chars are stuck on the transmit RingBuffer until another interrupt is generated for the Uart::IrqHandler(), e.g. an RXC.
I am not sure how the SARA modem would react to an incomplete AT command but it probably isn’t good.
In order to avoid a problem like this it seems like it would be better to first add the write data to the transmit RingBuffer and then use the Uart::flush() method to initiate the data transfer over the UART interface. The proposed code would require modifications to the Uart::write(data) and Uart::flush() methods:
///RTN Implementation Uart of flush() and write(data) methods that address race condition between
/// write(data) and IrqHandler which can leave part of a Uart write operation stranded in the
/// RingBuffer.
void Uart::flush()
{
bool firstWriteChar = true;
// Only initiate write to UART interface for first char in transmit RingBuffer. Let the Uart::IrqHandler()
// take over the write operation and empty the transmit RingBuffer.
while(txBuffer.available())
{
if (firstWriteChar && sercom->isDataRegisterEmptyUART())
{
uint8_t data = txBuffer.read_char();
sercom->writeDataUART(data);
sercom->enableDataRegisterEmptyInterruptUART();
firstWriteChar = false;
}
}
sercom->flushUART();
}
///RTN Queue all write data to the RingBuffer until either RingBuffer is full or flush() is called to
/// complete the Uart write operation.
size_t Uart::write(const uint8_t data)
{
if (txBuffer.isFull())
{
flush();
}
txBuffer.store_char(data);
return 1;
}
Note that the Uart::flush() method must always be invoked to push the data out the UART interface. This doesn't seem to be much of a problem for the MKRNB1500 using the MKRNB library because the Modem.send() method in the Modem.cpp source file of the MKRNB library always does this.
I am now running a modified Uart.cpp with the above code changes to see if I still get my infrequent hangs. Note that the Uart::write(data) method above does not have the logic to address Issue #262. That issue only appears to happen when the Uart::write(data) is invoked with interrupts disabled or inside a higher priority ISR. Specifically, it appears that the Issue #262 reproduction test used println() commands inside an ISR to reproduce the problem and I am not doing anything like that. The Uart::write(data) issued on the SARA modem UART interface is only invoked in the background (non ISR) task priority.
I am not sure if I am missing something, and I would appreciate it if someone would comment on this hang scenario and possible solution.
Based on the title, I think #472 could also be related. I haven't looked at your issue in detail, but I think it's not exactly the same issue, but they might be good to compare.
Also, I'd suggest you use ```
markers to mark your code blocks (see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code)
Thanks @matthijskooijman for looking at this. However, the scenario I have identified by code inspection only occurs with an empty transmit Ringbuffer. It is a timing issue where interrupts are enabled and the samd Uart::IrqHandler is driven with a TXC interrupt from a previous write of a character to the UDR register AND this interrupts the Uart::write(data) method which is running in the normal (non IRQ) processing. I am assuming that any Uart IRQ interrupt will suspend non IRQ processing. I also assume there is only one processing thread on the MKRNB1500 Cortex processor. These seem like reasonable assumptions for a simple IOT device.
I am using the write(data) code from the \samd\1.8.13\cores\arduino\Uart.cpp:
size_t Uart::write(const uint8_t data)
{
if (sercom->isDataRegisterEmptyUART() && txBuffer.available() == 0) {
sercom->writeDataUART(data);
} else {
// spin lock until a spot opens up in the buffer
while(txBuffer.isFull()) {
uint8_t interruptsEnabled = ((__get_PRIMASK() & 0x1) == 0);
if (interruptsEnabled) {
uint32_t exceptionNumber = (SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk);
if (exceptionNumber == 0 ||
NVIC_GetPriority((IRQn_Type)(exceptionNumber - 16)) > SERCOM_NVIC_PRIORITY) {
// no exception or called from an ISR with lower priority,
// wait for free buffer spot via IRQ
continue;
}
}
// interrupts are disabled or called from ISR with higher or equal priority than the SERCOM IRQ
// manually call the UART IRQ handler when the data register is empty
if (sercom->isDataRegisterEmptyUART()) {
IrqHandler();
}
}
txBuffer.store_char(data);
sercom->enableDataRegisterEmptyInterruptUART();
}
The window I am talking about is very small. It occurs when the Uart IrqHandler interrupts this code right after the else statement logic begins execution. The else logic is executed because the UDRE bit has not yet been set from the previous write on the Uart UDR register. If the UDRE bit is then set and TXC interrupt occurs right at the beginning of the else logic, the Uart IrqHandler would suspend this thread and begin execution. When the thread execution returns after the TXC interrupt is processed by the Uart IrqHandler, the while loop code is not executed because the transmit RingBuffer is empty. The data is then stored on the RingBuffer and UDRE interrupts are enabled. Unfortunately, the Uart IrqHandler will not get driven since the TXC interrupt for the previous write to the UDR has already completed processing.
I am assuming that any Uart IRQ interrupt will suspend non IRQ processing. I also assume there is only one processing thread on the MKRNB1500 Cortex processor. These seem like reasonable assumptions for a simple IOT device.
Yeah, I think those are indeed true.
From a quick reading, your analysis sounds good (and it does seem plausible that the code has race conditions), though I have not looked at the code in detail. I also won't find time for that, so I will leave this for others to comment on for now (I just wanted to point out the other issue, which is also related but, as you say, indeed a different issue).
To summarize I am running the modified Uart::flush() and Uart::write() methods documented at the top of this thread to make sure I don't hit the suspected issue I described above. I also believe these changes eliminate the need to define txBuffer as a SafeRingBuffer. There have been previous discussions about the lack of RingBuffer safety. See [https://github.com/arduino/ArduinoCore-API/issues/195] created by @KurtE.
I also made a change to protect against a double read of the UDR. It looks like that could happen if there is a frame error.
///RTN Modified Uart IrqHandler which guarantees the Handler never does a double Read of Uart UDR buffer.
void Uart::IrqHandler()
{
if (sercom->isFrameErrorUART()) {
// frame error, next byte is invalid so read and discard it
sercom->readDataUART();
sercom->clearFrameErrorUART();
} else if (sercom->availableDataUART()) {
rxBuffer.store_char(sercom->readDataUART());
if (uc_pinRTS != NO_RTS_PIN) {
// RX buffer space is below the threshold, de-assert RTS
if (rxBuffer.availableForStore() < RTS_RX_THRESHOLD) {
*pul_outsetRTS = ul_pinMaskRTS;
}
}
}
if (sercom->isDataRegisterEmptyUART()) {
if (txBuffer.available()) {
uint8_t data = txBuffer.read_char();
sercom->writeDataUART(data);
}
else {
sercom->disableDataRegisterEmptyInterruptUART();
}
}
if (sercom->isUARTError()) {
sercom->acknowledgeUARTError();
// TODO: if (sercom->isBufferOverflowErrorUART()) ....
// TODO: if (sercom->isParityErrorUART()) ....
sercom->clearStatusUART();
}
}
I believe these modified Uart::flush() and Uart::write() methods would allow txBuffer to be defined as a simple RingBuffer for the following reasons:
- txBuffer.store_char() is always invoked in background (non ISR) processing before a write operation is initiated on the UART interface.
- After Uart::flush() writes the first character to the UDR, txBuffer.read_char() is only performed in the ISR (Uart::IrqHandler()) and Uart::flush() simply waits in a while loop until the txBuffer is emptied by the ISR.
- The only requirement is that the Uart::flush() method MUST always be invoked in order to initiate write operations on the UART interface.
I have not looked at the changes needed to allow the rxBuffer to be defined as a simple RingBuffer, however, @KurtE has discussed the need to eliminate the _numElems variable and only use the _iHead and_iTail indices in the RingBuffer class. This would require changes in almost all of the RingBuffer methods. I have made no changes in the RingBuffer class methods for txBuffer safety.