eclipse-threadx/guix

_gx_system_thread_entry gets stuck in tx_queue_receive after a few hours

nflandin opened this issue · 10 comments

After a few hours (exact timing seems to vary) of running smoothly, GUIX stops responding.

Target Device: STM32H743ZI
Azure RTOS 6.1.7 with GUIX
IAR EWARM 9.30.1
Attempted Diagnoses & Workarounds: We've looked through the TraceX results, examined all the static variables we could, and stepped through all related code that we could find both before and after the failure occurs.

We have stripped out most of our application, and now have a project which is running only GUIX (once per 20ms) and a 'housekeeping thread', the latter of which which runs once per second, and updates one icon in the GUI (along with sending a few other events.

After running for some time, and responding to all the latest set of events, GUIX calls:

status = tx_queue_receive(&_gx_system_event_queue, &event_memory[0], TX_WAIT_FOREVER); in 'VOID _gx_system_thread_entry(ULONG id)'

It never emerges from this call, no matter how many events pile up in its queue.

This is a major problem for us, because we do not understand the cause of the error, or what may cause it to worsen further. We also think that the failure may be causing some sort of issue with tx_thread_sleep(), but we have not been able to verify this. The TraceX results look corrupted, and appear to be inconsistent with the observed behavior of the device.

STM32H743ZIGUIXstuck.zip
The failure occurs after event 650 in the log.

The GUIX system timer is continuing to expire in the fault state, and sending the appropriate events to the GUIX system queue. We have verified this both by stepping through the ThreadX timer expiration function, and by setting breakpoints in the appropriate places.

Interestingly, this issue appears to affect the STDIO (via SWD), causing pseudo-random corruption (likely missing bits) of approximately one-tenth of messages.

Our issue seems to exhibit similarly to eclipse-threadx/threadx#137 but I have not found any similar cause.

Here are screen captures of the variables associated with the queue and the thread respectively:

noFold GX QUEUE
noFold GX THREAD

I haven't yet gone through all of your input and screen shots, but we did have a very similar issue reported not long ago. In that case, the problem turned out to be the BASEPRI setting, which is a ThreadX configuration value. If this setting is wrong it can allow pre-emption during critical code sections and corruption of things like event queues. Can you check this value please? Let me know what you find and if that doesn't fix the issue I will dig into it first thing tomorrow.

@nflandin this is the resolution posted in the previous thread: "Finally I found the root cause. It's my mistake bacause define different TX_PORT_BASEPRI value in C preprocess symbol and condition asm control symbol.
I expect TX_PORT_BASEPRI to be 0x10(16), but TX_PORT_BASEPRI is 10 in asm, which will cause interrupt disable fail. Thank you for your help and patience" I'm hoping you are seeing a similar issue.

@jdeere5220 Thanks for the prompt response! We actually saw the thread you mentioned and checked for that possibility a little over a week ago. Unfortunately, it didn't appear to be the same root cause in our case since we're defining TX_INCLUDE_USER_DEFINE_FILE in our preprocessor symbols and then only ever defining BASEPRI in the "tx_user.h" file. I also wanted to take this opportunity to tie this issue here to the corresponding question thread here for convenience and to make sure all our data is consolidated.

The BASEPRI value we are using is 5, by the way, to answer your earlier question. Any guidance on investigative directions or answers, if you have them, would be amazing. We've been grappling this for a while now.

@Momo-12 BASEPRI also needs to be defined for assembly files.

@goldscott Thanks for pointing that out. Just found the usage of TX_PORT_BASEPRI in tx_thread_interrupt_disable.s. This may be our issue since there isn't a definition for that symbol in the assembly space. Testing this now and will get back as soon as I have results.

Okay, I have found a couple of (potential) issues.

First, (at least in our port,) _tx_initialize_low_level sets the SysTick priority to 0x40, with no reference to BASEPRI (0x50). Shouldn't those two be linked somehow? BASEPRI could automatically be set to the SysTick priority, unless over-ridden by a macro.

Second, _tx_initialize_low_level references __vector_table, which could cause problems for us, as we relocate the vector table to RAM. Wouldn't it be better for ThreadX pick up the vector table address from SCB->VTOR?

I'm still working these issues, but the BASEPRI idea seems to be a promising lead, though not in the same way the previous user/developer described.

@goldscott Okay, so in our port all the interrupt control assembly functions are actually overridden in tx_port.h inline assembly. There, it uses our TX_PORT_BASEPRI, although it's been a bit tricky stepping through the functionality here with an attached debugger what with everything being inline. Nevertheless, we've verified that the BASEPRI register changes at runtime to confirm absolutely that the control symbols are recognized.

@nflandin - you can configure the SysTick priority to be whatever you want. 0x40 is just our example. I suggest setting BASEPRI higher than the SysTick priority so that SysTick interrupts are blocked. You can edit the vector table, startup code, and tx_initialize_low_level to fit your needs.

@Momo-12 great to hear!

Closing the issue. If further issues arise, feel free to reopen it.