mcci-catena/arduino-lmic

Remove lengthy interrupt disables

terrillmoore opened this issue · 7 comments

On some BSPs "interrupts disabled" stop the clock from advancing properly. hal_waitUntil() hangs in os_init() because time is not advancing. See #523 and #521.

Various BSPs handle time differently when interrupts are disabled, and start up with different interrupt enabled/disabled state. The MCCI BSPs all have the property that time advances even when interrupts are disabled. But many BSPs do not.

The LMIC needs to be reworked to remove all the interrupt enable/disable logic that's presently there, as I am certain that there's no need for extended interrupt disables. Thanks @aparcar and @JackGruber for finding the root cause.

@altishchenko Just curious if are taking over this repository? This particular issue breaks quite a few boards, and I think @terrillmoore was going to work on a fix. I've got a handful of issues I've reported that are related to this issue as well that prevent using OTAA and ADR. I'm happy to work with you for a fix.

@czuvich Clay, I didn't really think about that, I am new to this community and, I am afraid, I won't be able to give a repository of such popularity required amount of attention in the long run. I don't know why @terrillmoore is not with us for some time now, probably busy on another more important project.
It just happens so that we are integrating this library as a replacement for the extremely heavyweight STM/Semtech protocol implementation we use at the moment and code economy of 15K+ is not something you can neglect in an MCU.
I am keen on fixing as many issues as I stumble upon, including bugs and architectural deficiencies, and I am more than happy to share my solutions. If you like, we can switch over to e-mail in order not to bloat the problem reporting cases with discussions (drop me a note at atishch@gmail.com, if you would).

As for the list of most important problems, I will summarize them as questions, because for some of them I don't have good convincing (me) reasons:

  1. Why enabling interrupts (and I can't disable them for other reasons) prevents OTAA joins at SF7/DR5, but works pretty good with SF12? - SF7 works ok with interrupts disabled as confirmed by @jpmeijers in #547.
  2. Why hal_waitUntil() function uses absolute time, as 90% of calls to it follow the same pattern of os_getTime() + delta? And what hal_waitUntil() does is - it calculates delta back and delays on it - two unnecessary calls to os_getTime() and delta arithmetic.
    2* Is there a situation (really) that hal_waitUntil() gets time in the past to wait for?
    2** What are the particular cases, where hal_waitUntil() does its intended function - waits until a determined moment in time and can they be converted to deltas? (4 places, all of them TX/RX window positioning).
  3. Is there a better way to provide generic, platform neutral delay based on CPU frequency, like many critical functions do, instead of interrupt-counter dependable RTE functions? (I have one for our project, but it is not at all generic and pretty much Cortex based).
  4. Where in the code disable/enable interrupts are really warranted?
  5. Can the race condition in os_runloop_once() be really solved by a simple flag?
  6. Does the function executed in os_runloop_once() require interrupts to be enabled?
  7. Does the code really really need microsecond precision for the whole of the library or may be changing the scale to 1ms and keeping microseconds only where they are definitely required will improve the complexity? (see 2)
  8. What are the low power possibilities out there in Xduino world, so the library code can be tailored to suite them? - I have low power with sleep and stop/standby, but my implementation is very "project"-specific. Providing generic solution to HAL API requires generic knowledge of the devices used mostly, which I lack.

These questions are biased towards our project and do not reflect 90% of standard use cases. Like, we don't need packaged functionality to schedule arbitrary timed jobs, we have it in another place, so the jobs that run in os_runloop_once() are only the ones scheduled by the library itself.

For the whole, I have it running pretty stable now - nearly a week without reset or rejoin, low power works fine and does not affect the data exchange, ADR works, OTAA works, race lockups or resets didn't happen for a good while, it does not now affect performance of the other project components. But again, all of this is very project specific, I just don't have these Arduino boards to try out the examples :)
Let's talk by e-mail!

Cheers,
Alex

@altishchenko Wow this is incredible feedback! You've put in way more effort than I have... I ended up punting to ABP without ADR since my use case is fairly simple and being prototyped. In terms of response to those questions, I really only have a little... and yes I'd love to connect with you via email. I'll be sending it from czuvich@gmail.com.

What are the low power possibilities out there in Xduino world,

LMIC from what I could tell does a good job keeping the radio asleep when not in use (I read 1.6uA on my custom board using nRF52840).

To add onto that, I would like to see an example that saves off all of the necessary information which allows you to power cycle the mCU without doing a rejoin (perhaps this falls under low power possibilities you mentioned earlier). I've scoured the forums for a good solution, but it just gets bounced around. My mCU power cycles after a deep sleep wake-up which is another reason why I use ABP. Anyways... I'll shoot an email to you and maybe we can at a minimum collaborate our findings. Again.. thanks for the very detailed explanations!

Hi @czuvich
Send my ESP32 to DeepSleep and wake up without rejoin.
https://jackgruber.github.io/2020-04-13-ESP32-DeepSleep-and-LoraWAN-OTAA-join/
its not perfect, but works for me.

@JackGruber Thank you! This is perfect.

@JackGruber Nice article. As for the note at the end of your article concerning micros() - this is one my concerns too, which I outlined in question 7 above. The MCU I work with has real time calendar module, (and ESP32 has one too), but its precision is milliseconds. The same story applies to Zero line of Arduinos, which are based on SAMD21. Fortunately for my case (STM32) and for the Zero line (SAMD21), SLEEPDEEP does not perform Reset on exit and RAM contents is preserved in power down, comparing to ESP32 which loses CPU context unless you are running from high speed RTC RAM. Duty cycles are definitely millisecond/seconds precision, so they can be managed by RTC calendar clock base. The problem stems from the fact that the same clock base in the library is used to calculate receive window timing and keep track of regular jobs lengthy jobs. This is my main concern at the moment as I am trying to decouple TOA and RX window calculation from usual house keeping which can be RTC-millisecond based.

@JackGruber Forgot to mention. One of the quick solutions I used was to have a flag that gets set in os_setCallback/os_setTimedCallback pair and gets reset in os_runloop_once() in place of hal_sleep(), it is taken into account when going low power decision is made. TX duty cycles are managed outside LMIC in my case, as soon as I succeed with LMIC+RTC, I will let LMIC manage them.