manuelbl/ttn-esp32

Joining infinite loop

BryanMM opened this issue · 26 comments

Greetings, i've been trying to use ur library for quite some time but i keep getting stuck at a joining infinite loop.
Sometimes it connects once and starts sending msgs but they either usually get rejected by TTN (v2) or they never reach their platform at all.
I've checked all the keys and reinstalled the component countless times but it doesnt help at all.
The board i'm currently working with is heltec's wireless stick lite.
Any help about my issue or advice would be appreciated.
Thanks.
image

This output doesn't look familiar. I've never seen such a loop. The most suspicious part is the transmission right after the join. It's still within the window of the first join so it could easily confuse the network and reset the successful join.

Are you using some sample code for this test? If not, can you post the code.

And what region are you in?

Hi, thanks for the answer. I'm currently using the north american region, and yes, i'm currently testing with the hello world's sample code, i tried with a different gateway (TTIG) with v2 and i got better results, with the only difference being that the first 3 msgs after the join are sent with no payload.
I saw at the pull request within the git that there's an issue with 8ch gateways and before i was using heltec's ht-m01. Maybe that's the issue?

The 8 channel limitation could indeed be an issue. The good news is that the underlying LMIC library has just released a new version, which supposedly improves the channel handling for regions like US. I will soon integrate the new version. Unfortunately, I can't test it as I'm in Europe and don't have a lab to simulate the US region.

I implemented the changes within the pull request that i saw here, that's how i got the TTIG (who's also an 8ch gw) so as u said i think the issue must be around that topic. The ht-m01 is not yet tho.
If needed once the implementation is done, i can help with the testing.

Hi, I am having the same issue raised here and using the Hello_World example. I am using AU915.

The code hangs at this line: xQueueReceive(lmicEventQueue, &event, portMAX_DELAY);
in the TheThingsNetwork.cpp file. I have managed to get two packets to send but I don't believe it was a result of any changes I made, going by this discussion.

Is it possible to lock in a DR? (SF7BW250) and lock in a channel (In case I use a single channel gateway).

Cheers

@DylanGWork It's currently not possible to set DR. That's planned though. There are no plans however for single channel operation.

An upcoming version will support a few changes relevant for you (@BryanMM and @DylanGWork):

  • Sub-band 2 is automatically selected for regions with sub-bands (incl. US915). If that's insufficient, it can be selected explicitly.
  • The data rate can be locked by disabling ADR (ttn.setAdrEnabled(false)) and setting the data rate (e.g. ttn.setDataRate(kTTNDataRate_US915_SF7);).

The changes are in the master branch. I would appreciate if you give it a try.

@manuelbl Got it!, i'll be testing it soon.

@manuelbl Tested and working great, is C version new too? I recall only having a C++ version?

Great work!

@DylanGWork Thanks for testing. Yes, the C version is new too.

@manuelbl I've been testing it out too and did render great results, i no longer needed to perform workarounds with the initial message (usually the first and second uplink bounces till a third one is sent and it has some probability of failure from there onwards).
Tested the band selection and spread factor's functions and also worked great.
Good job man.

@BryanMM Cool. Thanks for testing.

@manuelbl
I have the same problem with an infinite loop when join. I am using the ttn_join_provisioned () method to connect. If the gate is enabled, then the method successfully returns true, but if the gate is disabled, or is out of reach of the device, then I never get false and the method is in a blocked state. Help me understand under what conditions ttn_join_provisioned () should return false?

I (1258) ttn_prov: DevEUI, AppEUI/JoinEUI and AppKey saved in NVS storage
I (8472) ttn: event EV_JOINING
I (8534) ttn: event EV_TXSTART
I (13569) ttn: event EV_RXSTART
I (14565) ttn: event EV_RXSTART
I (14839) ttn: event EV_JOIN_TXCOMPLETE
I (78616) ttn: event EV_TXSTART
I (83650) ttn: event EV_RXSTART
I (84646) ttn: event EV_RXSTART
I (84920) ttn: event EV_JOIN_TXCOMPLETE
I (149551) ttn: event EV_TXSTART
I (154585) ttn: event EV_RXSTART
I (155581) ttn: event EV_RXSTART
I (155855) ttn: event EV_JOIN_TXCOMPLETE
I (226413) ttn: event EV_TXSTART
I (231498) ttn: event EV_RXSTART
I (232494) ttn: event EV_RXSTART
I (232768) ttn: event EV_JOIN_TXCOMPLETE
I (356974) ttn: event EV_TXSTART
I (362060) ttn: event EV_RXSTART
I (363056) ttn: event EV_RXSTART
I (363330) ttn: event EV_JOIN_TXCOMPLETE
I (483475) ttn: event EV_TXSTART
I (488560) ttn: event EV_RXSTART
I (489556) ttn: event EV_RXSTART
I (489830) ttn: event EV_JOIN_TXCOMPLETE

@maizezoidberg That's a good question indeed. The ttn_join() and similar functions mainly return false if no provisioning keys have been provided or they are invalid. If the device cannot immediately join, it will continue to try it. In particular, the spreading factor will also be increase in order to improve the chances of contacting a gateway. As the spreading factor is increased, the time between retries is also increased. I'm not sure if it ever gives up and returns false. Probably not.

How could we improve the library? Should we add a timeout parameter to the ttn_join() functions? If so, a realistic timeout is 10 minutes or more. Or should the function be changed to be asynchronous? It would make it easier to handle the error case but more difficult to handle the regular case.

@manuelbl,
Thanks for your quick response. In fact, the LMIC follows the https://www.thethingsnetwork.org/docs/devices/bestpractices/ specification for best practices. The device should use JOIN very rarely. Considering that ESP32 does not have very low power consumption during operation, we can set the "use_continuous_join" flag in the ttn_join() method, and if this flag is NOT set, look at getting EV_JOIN_TXCOMPLETE (means that "JOIN" in the response from Gate is NOT received) and return an error in the event_callback (...) method. But, after that, we must stop the JOIN process of the LMIC itself. Otherwise, we will exit the ttn_join () method, and the LMIC will still try to connect. This is one of the solutions. I'm ready to test it

ttn_event_t ttn_event = TTN_EVENT_NONE;

if (waiting_reason == TTN_WAITING_FOR_JOIN)
{
    if (event == EV_JOINED)
    {
        ttn_event = TTN_EVNT_JOIN_COMPLETED;
    }
    else if (event == EV_REJOIN_FAILED || event == EV_RESET || event == EV_JOIN_TXCOMPLETE)
    {
        ttn_event = TTN_EVENT_JOIN_FAILED;
    }
}

In fact, the LMIC follows the https://www.thethingsnetwork.org/docs/devices/bestpractices/ specification for best practices. The device should use JOIN very rarely.

That sounds like a misunderstanding. Best practices recommend to avoid rejoins by retaining the assigned DevAddr. But this case is about the initial join and in particular about the case where the join doesn't succeed. Failed joins don't count. This case is not covered in the best practices.

And best practices basically boil down to either not power off your device or to retain the session settings including DevAddr. The former one is out of LMIC's control, and the latter one is not implemented. I had to go to some length to make work anyway.

Your proposal of changing ttn_join() is basically to add an option to abort the join if the first try fails. There are many reasons why a join can fail: too high data rate, RF TX collision, radio disturbance etc. It's not reliable to detect if there is a gateway nearby. Thus I think aborting after just a single try will not be useful to many people.

The options I'm considering are:

  • Aborting after the lowest data rate has failed (I think that's what the current implementation does but it takes very long)
  • Abort after a specified time
  • Abort after a specified number of tries

I will think about it.

Hi guys, great conversation.

I have implemented an abort process (I even change an LED to red to indicate this) to the join process after 5 failed join processes, it's a messy implementation though.

Would be great to see this as a feature.

This may be a silly question that I can just look up, but while I'm here: Can we have the default join DR be the lowest DR, or an easy way to set it as that?

cdrx commented

How could we improve the library? Should we add a timeout parameter to the ttn_join() functions? If so, a realistic timeout is 10 minutes or more. Or should the function be changed to be asynchronous? It would make it easier to handle the error case but more difficult to handle the regular case.

An async version of ttn_join() would be really useful. Something like this:

ttn_join_async();
uint8_t timer = 0;

while (ttn_is_joined() == false) {
   timer++;

   if (timer > 120) {
       ttn_join_abort();
   }

    vTaskDelay(1 second);
}

ESP_LOGI(TAG, "joined!");

Would be ideal.

For my use case; the TTN provisioning is done by writing keys to the ESP over bluetooth, from a mobile app. If the user writes incorrect keys, then ttn_join() is ultimately called but never returns (because the join will never succeed). If the user updates the provisioning keys, over bluetooth connection, I can't find a practical way to cancel an active ttn_join() and try again with new keys.

Hi All,

I have implemented the Hello World test code and also get an infinite join loop. Occasionally i will see an Accept Join request on TTN but never any payload data. Serial monitor shows:
[0;32mI (33376) ttn: event EV_TXSTART�[0m
[0;32mI (38716) ttn: event EV_RXSTART�[0m
[0;32mI (39716) ttn: event EV_RXSTART�[0m
[0;32mI (39826) ttn: event EV_JOIN_TXCOMPLETE�[0m
[0;32mI (40736) ttn: event EV_TXSTART�[0m

I am using AS923 on my Gateway, and node (TTN setup)
I am using AS923 in the code also via setting menu.

Has anyone been able to get around this?

Hi All,

I have implemented the Hello World test code and also get an infinite join loop. Occasionally i will see an Accept Join request on TTN but never any payload data. Serial monitor shows: [0;32mI (33376) ttn: event EV_TXSTART�[0m [0;32mI (38716) ttn: event EV_RXSTART�[0m [0;32mI (39716) ttn: event EV_RXSTART�[0m [0;32mI (39826) ttn: event EV_JOIN_TXCOMPLETE�[0m [0;32mI (40736) ttn: event EV_TXSTART�[0m

I am using AS923 on my Gateway, and node (TTN setup) I am using AS923 in the code also via setting menu.

Has anyone been able to get around this?

So i also managed to fix the issue by inserting the below into the thethingsnetwork.cpp

bool TheThingsNetwork::joinCore()
{
if (!provisioning.haveKeys())
{
ESP_LOGW(TAG, "Device EUI, App EUI and/or App key have not been provided");
return false;
}

So the problem has been solved?

BTW: If the file TheThingsNetwork.cpp contains the method joinCore(), you are using an old version of the library. This method was removed more than a year ago.

So the problem has been solved?

BTW: If the file TheThingsNetwork.cpp contains the method joinCore(), you are using an old version of the library. This method was removed more than a year ago.

Yes it is solved but only if I add the above code to thethingsnetwork.cpp

I have downloaded the source code from here so is there a way I could somehow have the old library? In my ignorance (new to this) I thought the library was supplied within.

You have probably downloaded the code from the Releases. I have indeed not updated this for some time. Now it's up-to-date again.

You can either download it from the release page or with green "Code" button on the home page.

Excellent I will try this later today.

I had blindly followed the download in the Getting Started guide (Platformio also the same, I use this)

https://github.com/manuelbl/ttn-esp32/archive/master.zip

Hi, I encountered the same issue of an infinite loop when testing the 'Hello World' example on a Heltec Wireless Bridge with an ESP32 and SX1276 transceiver. I've tried all the suggestions written in this forum, but without success. I receive random join requests, but they are not successful. I noticed that the RSSI is -110, but when I compile the code in Arduino with the Heltec library, the RSSI is -40. Thanks for your assistance.