Lora-net/LoRaMac-node

Very long duty cycle wait time in version 4.70

omogenot opened this issue ยท 21 comments

Implementing the latest LoRaMAC release (a.k.a. 4.70) I noticed a strange behaviour in the Duty Cycle management using EU868.

To provoke it, make a node try to join unsuccessfully using OTAA (unknown to the network server for instance). After 24 trials, the LoRaMAC would return a status of LORAMAC_STATUS_DUTYCYCLE_RESTRICTED (11). Which is expected.
As the comment for this status value states in the LoRaMac.h file:

/*!
     * An MCPS or MLME request can return this status. In this case,
     * the MAC cannot send the frame, as the duty cycle limits all
     * available bands. When a request returns this value, the
     * variable "DutyCycleWaitTime" in "ReqReturn" of the input
     * parameters contains the remaining time to wait. If the
     * value is constant and does not change, the expected time
     * on air for this frame is exceeding the maximum permitted
     * time according to the duty cycle time period, defined
     * in Region.h, DUTY_CYCLE_TIME_PERIOD. By default this time
     * is 1 hour, and a band with 1% duty cycle is then allowed
     * to use an air time of 36 seconds.
     */

I checked the DutyCycleWaitTime value from the ReqReturn element from the input parameters to check how long I must go to sleep until the next attempt. I got a huge value of 38135106 ms which means more than 10 hours and half !!! We are far from the expected 1 hour, which is not coming from the DUTY_CYCLE_TIME_PERIOD in Region.h apparently, but from MacCtx.DutyCycleWaitTime which is set to zero by RegionNextChannel (a.k.a. RegionEU868NextChannel in my case) which in turn would call RegionCommonIdentifyChannels or left as it is. So where are these 38135106 coming from?
Has anything changed there, since I don't remember having had this issue previously in 4.6?

I thank you in advance for your help.

Olivier.

Thanks for the report we will investigate this as soon as possible.

We have recently refactored the way the duty-cycle is managed please refer to commit b6f383c. This commit was trying to solve the issue #1315.

We may have missed something while making the refactoring. However from our internal tests at the time we did not observed the issue that you are reporting.

I have just run a test session using the periodic-uplink-lpp example released with 4.7.0 version and I do not observe the issue that you have.

The node transmits the 24 JoinRequests and then restricts the duty-cycle for the next 3414796 ms (~57 minutes)

Have you made modifications to the stack code?

###### =========== MLME-Request ============ ######
######               MLME_JOIN               ######
###### ===================================== ######
STATUS      : OK

###### =========== MLME-Confirm ============ ######
STATUS      : Rx 2 timeout
22
###### =========== MLME-Request ============ ######
######               MLME_JOIN               ######
###### ===================================== ######
STATUS      : OK

###### =========== MLME-Confirm ============ ######
STATUS      : Rx 2 timeout
23
###### =========== MLME-Request ============ ######
######               MLME_JOIN               ######
###### ===================================== ######
STATUS      : OK

###### =========== MLME-Confirm ============ ######
STATUS      : Rx 2 timeout
24
###### =========== MLME-Request ============ ######
######               MLME_JOIN               ######
###### ===================================== ######
STATUS      : Duty-cycle restricted
Next Tx in  : 3414796 [ms]

###### ============ CTXS STORED ============ ######
Size        : 668


###### =========== MLME-Request ============ ######
######               MLME_JOIN               ######
###### ===================================== ######
STATUS      : Duty-cycle restricted
Next Tx in  : 3408624 [ms]

@mluis1
Thanks for taking some time to test... I did not modify anything in the stack by itself, just use external board files to adapt to my micro (SPI, Timer, RTC, etc...)
How is this 3408624 computed? Is there anything linked to the hardware you use (RTC using Calendar for instance). If something has changed in the timer.h/timer.c section, then I might have a difference.

The last time we have made a modification to timer.h file was on October 2019 and to timer.c it was on December 2018.

The minTimeToWait variable which is used to display "Next Tx in" in the log is computed by the RegionCommonUpdateBandTimeOff function.

TimerTime_t RegionCommonUpdateBandTimeOff( bool joined, Band_t* bands,
uint8_t nbBands, bool dutyCycleEnabled,
bool lastTxIsJoinRequest, SysTime_t elapsedTimeSinceStartup,
TimerTime_t expectedTimeOnAir )
{
TimerTime_t minTimeToWait = TIMERTIME_T_MAX;
TimerTime_t currentTime = TimerGetCurrentTime( );
TimerTime_t creditCosts = 0;
uint16_t dutyCycle = 1;
uint8_t validBands = 0;
for( uint8_t i = 0; i < nbBands; i++ )
{
TimerTime_t elapsedTime = TimerGetElapsedTime( bands[i].LastBandUpdateTime );
// Synchronization of bands and credits
dutyCycle = UpdateTimeCredits( &bands[i], joined, dutyCycleEnabled,
lastTxIsJoinRequest, elapsedTimeSinceStartup,
currentTime, elapsedTime );
// Calculate the credit costs for the next transmission
// with the duty cycle and the expected time on air
creditCosts = expectedTimeOnAir * dutyCycle;
// Check if the band is ready for transmission. Its ready,
// when the duty cycle is off, or the TimeCredits of the band
// is higher than the credit costs for the transmission.
if( ( bands[i].TimeCredits > creditCosts ) ||
( ( dutyCycleEnabled == false ) && ( joined == true ) ) )
{
bands[i].ReadyForTransmission = true;
// This band is a potential candidate for an
// upcoming transmission, so increase the counter.
validBands++;
}
else
{
// In this case, the band has not enough credits
// for the next transmission.
bands[i].ReadyForTransmission = false;
if( bands[i].MaxTimeCredits > creditCosts )
{
// The band can only be taken into account, if the maximum credits
// of the band are higher than the credit costs.
// We calculate the minTimeToWait among the bands which are not
// ready for transmission and which are potentially available
// for a transmission in the future.
TimerTime_t observationTimeDiff = 0;
if( bands[i].LastMaxCreditAssignTime >= elapsedTime )
{
observationTimeDiff = bands[i].LastMaxCreditAssignTime - elapsedTime;
}
minTimeToWait = MIN( minTimeToWait, observationTimeDiff );
// This band is a potential candidate for an
// upcoming transmission (even if its time credits are not enough
// at the moment), so increase the counter.
validBands++;
}
}
}
if( validBands == 0 )
{
// There is no valid band available to handle a transmission
// in the given DUTY_CYCLE_TIME_PERIOD.
return TIMERTIME_T_MAX;
}
return minTimeToWait;
}

The platform I used to run the test is a NucleoL476+SX1261MBXBAS shield. The rtc-board.c file did not change since March 2021.

@mluis1
Thanks again for these precisions.

However, I dove in the code and I suspect (although I don't have any proof), that the problem might arise from the fact this computation is dependant on elapsedTimeSinceStartup which is coming from the LoRaMac by reading the non volatile RAM for group2 parameters. However, this value is initiated inLoRaMacInitialization and in the main application there is a call to NvmDataMgmtRestore being called afterwards (at least in my code), this elapsedTimeSinceStartup value is then replaced by the one stored in nvram.
Therefore, the DutyCycle calculation might work if the unit is hardware reset, whereas there could be a value shift if the unit is just soft reset. In my case, the RTC is using a free run timer that is NOT reset by a soft reset.
I propose to add the following code after the call to NvmDataMgmtRestore in the main application (in LmHandlerInit in your case):

    // Store the current initialization time
    MibRequestConfirm_t mibReq;
    mibReq.Type = MIB_NVM_CTXS;
    LoRaMacMibGetRequestConfirm( &mibReq );
    LoRaMacNvmData_t* nvm = mibReq.Param.Contexts;
    nvm->MacGroup2.InitializationTime = SysTimeGetMcuTime( );

which is a bit complicated or just at the end of the NvmDataMngmtRestore function if this function is supposed to be called only once after the LoRaMac initialisation.

    if( NvmmRead( ( uint8_t* ) nvm, sizeof( LoRaMacNvmData_t ), 0 ) ==
                  sizeof( LoRaMacNvmData_t ) )
    {
        nvm->MacGroup2.InitializationTime = SysTimeGetMcuTime( );
        return sizeof( LoRaMacNvmData_t );
    }

I'll do some more testing...

Olivier.

@mluis1

Well I just finished a set of tests and I can confirm that the elapsedTimeSinceStartup value is the culprit. It worked on your case by chance, because we can assume that, this value being always reset by each device reboot, the time spent to perform the startup process is more or less always the same as the value stored in the nvram.

My comment about this value initialised by LoRaMacInitialization being overwritten by NvmDataMgmtRestore is still valid and shall probably be fixed... I let you decide.

I did some extensive test, and I got, I think, the expected results:

1/ First attempt to join unsuccessfully leads to a more or less 1 hour (less the time spent to perform these attempts) duty cycle wait.
2/ Second attempt to join unsuccessfully after this 1 hour wait leads to the famous more than 10 hours wait (in fact 11 hours minus the time spent to perform the attempts). This is expected.
After these 10 hours wait, I suspect that the amount of time would be 35 hours as per the function UpdateTimeCredits located in the RegionCommon.cfile. I did not test because it's way too long to wait ๐Ÿ˜

Olivier.

@mluis

I'm back with this "black out" feature understanding.
As far as I can understand, the latest LoRaMac stack has introduced this 1 hour / 10 hour / 30 hour delay. Is this sequence part of any regional regulation, or is it an empiric choice of the team?
Although I understand where you're going to with this feature to prevent RF pollution when a device wants to initially join the network upon first commissioning and fails to join the network for some time and delay the retry attempts.
However this defeats any ability of the device of trying to re-join a lost connection with the server as this 1/10/30 delay timer is not reset. In other words, shall the device try to re-join after several days, it will be forced to wait for 30+ hours after an unsuccessful join request because this TimeCredit is based on elapsedTimeSinceStartup.
I think that the whole time credit shall be calculated on an elapsed time that shall be resettable by function when the unit starts to try to re-join the network. Doing so would allow a device to detect network loss and initiate a re-join cycle (using the 1/10/30 feature) to try to recover from this network loss situation.
Any comments are welcome.

Olivier.

mluis commented

@mluis1 and @mluis are indeed different people :)

@mluis @mluis1

Hey ! Nice to meet you both ๐Ÿ˜„

Does any of you have an idea to help with the 1/10/30 hours delay stuff ?

I thank you in advance,

Olivier.

mluis1 commented

@omogenot
The described behavior is present in LoRaWAN specifications since version 1.0.1. The chapter title is "Retransmissions back-off".

This project implements such behavior since LoRaWAN 1.0.1 as can be seen in the CHANGELOG.md

In LoRa Alliance technical committee we held discussions about this subject and for next LoRaWAN specification version this chapter has been revised.

For instance the text:

Aggregated during the first hour following power-up or reset

becomes:

Aggregated during the first hour following the initiating event

Another difference will be that these rules will apply not only to the Join but to normal uplinks as well.

Getting back to our discussion. I agree that we should not store the InitializationTime on the NVM.
Could you please apply below patch an verify that it solves the observed issue?

diff --git a/src/mac/LoRaMac.c b/src/mac/LoRaMac.c
index 03bf4e400..9668bcd9d 100644
--- a/src/mac/LoRaMac.c
+++ b/src/mac/LoRaMac.c
@@ -266,6 +266,12 @@ typedef struct sLoRaMacCtx
      * Buffer containing the MAC layer commands
      */
     uint8_t MacCommandsBuffer[LORA_MAC_COMMAND_MAX_LENGTH];
+    /*
+    * Stores the time at LoRaMac initialization.
+    *
+    * \remark Used for the BACKOFF_DC computation.
+    */
+    SysTime_t InitializationTime;
 }LoRaMacCtx_t;
 
 /*
@@ -837,7 +843,7 @@ static void ProcessRadioTxDone( void )
     // Update last tx done time for the current channel
     txDone.Channel = MacCtx.Channel;
     txDone.LastTxDoneTime = TxDoneParams.CurTime;
-    txDone.ElapsedTimeSinceStartUp = SysTimeSub( SysTimeGetMcuTime( ), Nvm.MacGroup2.InitializationTime );
+    txDone.ElapsedTimeSinceStartUp = SysTimeSub( SysTimeGetMcuTime( ), MacCtx.InitializationTime );
     txDone.LastTxAirTime = MacCtx.TxTimeOnAir;
     txDone.Joined  = true;
     if( Nvm.MacGroup2.NetworkActivation == ACTIVATION_TYPE_NONE )
@@ -2980,7 +2986,7 @@ static LoRaMacStatus_t ScheduleTx( bool allowDelayedTx )
     nextChan.AggrTimeOff = Nvm.MacGroup1.AggregatedTimeOff;
     nextChan.Datarate = Nvm.MacGroup1.ChannelsDatarate;
     nextChan.DutyCycleEnabled = Nvm.MacGroup2.DutyCycleOn;
-    nextChan.ElapsedTimeSinceStartUp = SysTimeSub( SysTimeGetMcuTime( ), Nvm.MacGroup2.InitializationTime );
+    nextChan.ElapsedTimeSinceStartUp = SysTimeSub( SysTimeGetMcuTime( ), MacCtx.InitializationTime );
     nextChan.LastAggrTx = Nvm.MacGroup1.LastTxDoneTime;
     nextChan.LastTxIsJoinRequest = false;
     nextChan.Joined = true;
@@ -3892,7 +3898,7 @@ LoRaMacStatus_t LoRaMacInitialization( LoRaMacPrimitives_t* primitives, LoRaMacC
     TimerInit( &MacCtx.ForceRejoinReqCycleTimer, OnForceRejoinReqCycleTimerEvent );
 
     // Store the current initialization time
-    Nvm.MacGroup2.InitializationTime = SysTimeGetMcuTime( );
+    MacCtx.InitializationTime = SysTimeGetMcuTime( );
 
     // Initialize MAC radio events
     LoRaMacRadioEvents.Value = 0;
diff --git a/src/mac/LoRaMac.h b/src/mac/LoRaMac.h
index e768d0926..992dbc000 100644
--- a/src/mac/LoRaMac.h
+++ b/src/mac/LoRaMac.h
@@ -696,12 +696,6 @@ typedef struct sLoRaMacNvmDataGroup2
      * Aggregated duty cycle management
      */
     uint16_t AggregatedDCycle;
-    /*
-    * Stores the time at LoRaMac initialization.
-    *
-    * \remark Used for the BACKOFF_DC computation.
-    */
-    SysTime_t InitializationTime;
     /*
      * Current LoRaWAN Version
      */

@mluis1

Thanks for the explanation. I understand the code modification that will move the InitializationTime value from the NVM to the RAM as we discussed before.
Now let me try to understand this "Retransmission Back-Off" period (let's call it BOP for Back-Off Period as a shorthand).
If I understand well, this feature is to manage abnormal "gossip" devices on the network. That is a device that would defeat the legal transmission duty cycle due to hardware failure (receiver chain failure for instance), configuration error (too short transmission period, since this BOP will apply on 'regular' uplinks as well), missing/broken gateway, network server config error or even device being banned by the network server. In this case, the "gossip" device would periodically transmit until it reaches the duty cycle limit, wait some time to be in the duty cycle allowance again and transmit again up to duty cycle limit. This behaviour would pollute the network.
To avoid this situation, such a "gossip" device would be penalised by adding a BOP after having waited for the duty cycle to be legal again. This penalty will grow from 1 hour, to 10 (+ 1) hours and finally to 24 (+ 10 + 1) hours ad vitam eternam to "isolate" this abnormal case. Is that right? Did I understand well?
If this is the case, then the proposed fix would not work as expected. Since the InitializationTime variable is initialised only once at the LoRaMacInitialization function, a device that worked correctly for several days or months, should it become "gossip" for whatever reason would have a penalty of 24 (+ 10 + 1) hours applied at first because it started long ago. It will not "benefit" of the evolving penalty line of 1, 10, 24 as expected.
I would suggest that the InitializationTime variable shall be initialised with a remarkable value (0 for instance) and that the BOP algorithm would just return 0 as a penalty when the InitializationTime is not set.
As soon as an abnormal situation is detected (expected network answer not received or duty cycle limit reached), the InitializationTime value is initialised with the current SysTimeGetMcuTime value. Then the BOP will start to apply the progressing penalty of 1, 10, 24 hours.
On the other hand, as soon as a "normal" situation is detected (valid ACK or command received from the network for this device), the InitializationTime value is reset to 0, so that next abnormal situation will restart with 1, 10, 24 hours penalty evolution. In which case the InitializationTime variable shall probably be renamed as LastNetworkMsgTime or something like that.
Does it make sense? Would that serve the purpose of the "Retransmission back-off" chapter of the LoRaWAN spec?

I'm working on a project where the network has temporary break downs (gateway hardware changes). And as soon as a device detects such a break down it has to perform a join automatically to recover (as the devices are associated gateway by gateway in their config). As of today, with the latest LoRaMAC version (4.70), the device trying to re-join is always pushed back for 35 hours, which is way too long for the customer as a recovery plan (it was not the case with the previous LoRaMAC version), and is not logical if the expected behaviour is the one I exposed above.

Edit:
How all this coexists with ADR when the device detects it does not reach the network and needs to fallback to DR_0 with full power?

Sorry for having been so long in my explanations, however I wanted to be sure that we are talking about the same thing in terms of BOP algorithm.

Do not hesitate to contact me should I be of any help.

Best Regards,

Olivier.

@mluis1

Following my previous message, I propose the following change (after having applied your suggested changes), in order to "reset" the initialisation time after having received a valid response from the network.
At the very end of the function ProcessRadioRxDone(), after the UpdateRxSlotIdleState( ); statement, I propose to add the following statements:

    if ((MacCtx.MacState & LORAMAC_RX_ABORT) == 0) {
    	MacCtx.InitializationTime = SysTimeGetMcuTime( );
    }

This would allow the BACKOFF_DC to be computed from the last successful communication received from the network. Would that work for you?

@mluis1

Thanks for the explanation. I understand the code modification that will move the InitializationTime value from the NVM to the RAM as we discussed before. Now let me try to understand this "Retransmission Back-Off" period (let's call it BOP for Back-Off Period as a shorthand). If I understand well, this feature is to manage abnormal "gossip" devices on the network. That is a device that would defeat the legal transmission duty cycle due to hardware failure (receiver chain failure for instance), configuration error (too short transmission period, since this BOP will apply on 'regular' uplinks as well), missing/broken gateway, network server config error or even device being banned by the network server. In this case, the "gossip" device would periodically transmit until it reaches the duty cycle limit, wait some time to be in the duty cycle allowance again and transmit again up to duty cycle limit. This behaviour would pollute the network. To avoid this situation, such a "gossip" device would be penalised by adding a BOP after having waited for the duty cycle to be legal again. This penalty will grow from 1 hour, to 10 (+ 1) hours and finally to 24 (+ 10 + 1) hours ad vitam eternam to "isolate" this abnormal case. Is that right? Did I understand well? If this is the case, then the proposed fix would not work as expected. Since the InitializationTime variable is initialised only once at the LoRaMacInitialization function, a device that worked correctly for several days or months, should it become "gossip" for whatever reason would have a penalty of 24 (+ 10 + 1) hours applied at first because it started long ago. It will not "benefit" of the evolving penalty line of 1, 10, 24 as expected. I would suggest that the InitializationTime variable shall be initialised with a remarkable value (0 for instance) and that the BOP algorithm would just return 0 as a penalty when the InitializationTime is not set. As soon as an abnormal situation is detected (expected network answer not received or duty cycle limit reached), the InitializationTime value is initialised with the current SysTimeGetMcuTime value. Then the BOP will start to apply the progressing penalty of 1, 10, 24 hours. On the other hand, as soon as a "normal" situation is detected (valid ACK or command received from the network for this device), the InitializationTime value is reset to 0, so that next abnormal situation will restart with 1, 10, 24 hours penalty evolution. In which case the InitializationTime variable shall probably be renamed as LastNetworkMsgTime or something like that. Does it make sense? Would that serve the purpose of the "Retransmission back-off" chapter of the LoRaWAN spec?

I'm working on a project where the network has temporary break downs (gateway hardware changes). And as soon as a device detects such a break down it has to perform a join automatically to recover (as the devices are associated gateway by gateway in their config). As of today, with the latest LoRaMAC version (4.70), the device trying to re-join is always pushed back for 35 hours, which is way too long for the customer as a recovery plan (it was not the case with the previous LoRaMAC version), and is not logical if the expected behaviour is the one I exposed above.

Edit: How all this coexists with ADR when the device detects it does not reach the network and needs to fallback to DR_0 with full power?

Sorry for having been so long in my explanations, however I wanted to be sure that we are talking about the same thing in terms of BOP algorithm.

Do not hesitate to contact me should I be of any help.

Best Regards,

Olivier.

I have the same behaviour here (Rejoin process spec 1.0.4). After some days of normal working device, a Gateway can broken and the device needs to check it's connectivity. After ADR backoff limit (Reset to default channels), I started a new Join request process and get a hight penality of duty cycle restrictiv. I think that always a new Join Process need to start the InitializationTime with the current SysTimeGetMcuTime. Rejoin process is very dificult to use if we have a hard penality :/

I think I might be experiencing problems related to this, since adopting 4.7.0 in our project.

We have some test devices which are reporting at a high uplink rate, and we have DCR enabled in the stack. This worked ok in prior versions of the stack (I think we skipped over 4.6).

Our devices (Class A/OTAA/EU868) are self-powered and never normally reset, and we do not store the stack state in Nvm (although we do store the join nonce so that should a reset occur, we can pick up with a new join).

Devices will always join after a reset/first power on, and they track MCPS indications, and perform periodic LinkCheckReq to validate the session/network reachability. We determined early on that there is no way to distinguish between gateway/backhaul outage and session invalidation on the network server, and so when we detect lack of RX activity beyond a threshold (~24 hours), we double check with multiple LinkCheckReq with 0 payload ove time and ultimately will fall back to a new join.

These new join attempts follow an exponential back off that degrades the datarate from 5 -> 2 over the first few attempts, as well as increasing the interval between. Each attempt is randomised over an hour, so in aggregate they'd be at 30 minute intervals.

There are two problems we have seen:

  1. When devices fall back to rejoin, the stack reports busy, seemingly indefinitely. We've left devices in this state for 5+ days and the stack always reports busy on each attempt. We have since add a LoRaMacReset() when we fall back to rejoin, but have yet to verify if this resolves the busy state.

  2. In these test devices, as mentioned, we have a high TX rate. The device sends 61 byte payloads every 10 seconds, plus a 10 byte payload every minute. The device reports DCR after around 40 minutes of activity, usually provoked by one of the larger payloads. It will then accept one of the smaller payloads (LoRaMacMcpsRequest returns OK), but the stack holds onto the message for the amount of time previously reported for DutyCycleWaitTime, which equates to ~20 minutes. Eventually the stack reports the MCPS confirm for that held packet. Meanwhile, the stack is reporting busy. I think this can lead to the first problem, because if this coincides with the device reverting to rejoin, the stack may be in a busy state, and even with the LoRaMacReset, the attempt to issue a subsequent join will conflict with the message being held for DCR reasons.

I think the stack accepts the smaller payload because the aggregate airtime in the DC period falls below the threshold, where the prior larger packet does not.

It's not clear to me whether the stack accepting and holding the message for the DutyCycleWaitTime is intended functionality, but its possibly interferring with how we have implemented our LinkCheckReq and rejoin logic to cope with external problems.

It's also not clear whether repeated attempts by the application to either send uplinks of joins would cause any busy/DCR to be extended indefinitely, but it does seem that the stack can get into a state where repeated joins are never accepted. If there is now a 24 join attempt limit before the stack starts to enforce even longer DCR waits than we expect (especially if the wait is excessively large at the original issue here proposes), and in conflict with our own exponential backoff strategy, then we will obviously have to reconsider our approach.

I wonder if anyone can comment on what the intended behaviour is, and point to the documentation/change log if possible.

Extracts from our logs with timings:

2023-07-15 08:54:29.013: Sent: <<large payload>> DR5
2023-07-15 08:54:30.196: MCPS Confirm: OK DR5 Ch0 AirTime=139
2023-07-15 08:54:30.197: MCPS Indication: OK DR5 Port=0 Pending=No RxData=No Size=0 Rssi=-42 Snr=7 Slot=0 Ack=No FCnt=1586, DevTimeAns=No
2023-07-15 08:54:44.003: Send attempt: Duty Cycle Restricted for 1231001 msecs
2023-07-15 08:55:02.011: Sent: <<small payload>> DR5
<<the stack now seems to report busy although the logs don't capture the actual code, but the flow is consistent with busy handling>>
2023-07-15 09:15:17.305: MCPS Confirm: OK DR5 Ch5 AirTime=62
<<normal activity resumes until the next DCR reported>>

@mluis1 Would appreciate some feedback on this, as we are now considering reverting back to a release prior to 4.7.0 for stability. There seems to have been little activity on this project in the last 6 months, is there something that we should be aware of?

Hello guys,
Agreed with everyone in here, although I consider fundamental to comply with the specification.
Now it is clear that the spec in its current version mentions "following power-up or reset", and the reset case is not really discussed.
Is a reset a complete device reset? or maybe a stack reset?
Same with the envisaged modification replacing power-up and reset cases by an even more confising "Initiating event".

I would consider safe, and conservative enough a refresh of the initializationTime as proposed by @omogenot to comply with both definitions of the spec (current and future).

Anyhow, it seems that moving the InitializationTime out of the NVM is quite mandatory if we want to refresh it regularly, and also takes place for nothing in the NVM

@mluis1 can you please give us an update on the topic?

  • why not push the InitializationTime patch?
  • what about the discussion regarding a refresh of InitializationTime?

I might be wrong but I made some Rejoin Tests after calculating and implementing a compliant rejoin process, and this does not work as expected!
As a reminder here is what is in the spec (except last column which is my interpretation) :

1 Aggregating during the first hour following power or reset T0<t<T0+ Transmit time < 36s 1% duty-cycle 24 tries @sf12
2 Aggregating during the next 10 hours T0+1<t<T0+11 Transmit time < 36s 0.1% duty-cycle 24 tries @sf12
3 Aggregating during the first 11 hours for over 24 hours T0+11+N*24<t<T0+11+35+N*24 w/ N>=0 Transmit time < 8.7s 0.01% duty-cycle 5 tries @sf12

Now, from my calculation I am allowed 5 Join Request tries by 24hours Retransmission Time period (I just considered that if 24 tries are in the range of 36seconds, then it makes more than 5 but not 6 in 8.7secs.
In my implementation I have been conservative with a period of 6hours which makes 4 tries and not 5 in 24hours periods.

The 2 last columns are my try session results.
Although I made it out to be compliant with a margin, I still face Duty Cycle Restricted errors in window index 4 and 5, with NextTxIn calculated to almost 5 hours.

The test is still running.

Index of Join Request Scheduled (mins) after last request Retransmission time window Index of Join Request within window Window time since T0 Theoretical Join Requests number within window Uptime Test in secs BackOff required in hours
T0
1 0 1 1 24 1
2 5 1 2
3 10 1 3
4 20 1 4 2124
T0+1 3600
5 30 2 1 24 3932
6 60 2 2 7540
7 60 2 3 11147
8 60 2 4 14755
9 60 2 5 18363
10 60 2 6 21971
11 60 2 7 25579
12 60 2 8 29186
13 60 2 9 32794
14 60 2 10 36402
T0+11 39600
15 60 3 1 5 40010
16 360 3 2 61618
17 360 3 3 83225
18 360 3 4 104833
T0+35 126000
19 360 4 1 5 126441
20 360 4 2 148049 4,989175278
21 360 4 3 169649
22 360 4 4 191257
T0+35+24 212400
23 360 1 5 212864
24 360 2 234472
25 360 3 256080
26 360 4 277688 4,989175556
T0+35+24*2
27 360 1 5 299288
28 360 2 320895
mluis1 commented

I am sorry for the low pace on answering questions.

In general I agree with most of the provided comments.

As I have already stated the current implementation is in accordance with the LoRaWAN 1.0.4 specification and passes the LoRa Alliance certification process (There is a specific test case for this behavior.)

I will try to come up as soon as possible with a solution that suits most of the requests done here.

Keeping in mind that the final solution must:

  • Be LoRaWAN specification compliant
  • Pass the LoRa Alliance certification process. (Running this test is quite time consuming as it takes more than 1 day to run.)
  • Remove from the NVM the storage of the retransmission back off algorithm initial time (provided patch).
  • From the comments the reference time of the retransmission back off algorithm should be taken at the first JoinReq frame transmission and the back off algorithm should be re-initialized once a JoinAccept frame is received. Subsequent JoinReq restarts the process.

@Regimbal From our internal tests and the certification process the current implementation is believed to be correct.

The LoRaWAN 1.0.4 specification states the following

image

According to the specification we should end up with:

A joinReq time on air at SF12 is 1.482s thus, for an allowed maximum transmit time of 36s we have the right to transmit 36/1.482=~24.3 times before restricting the duty-cycle.

  • Power up the device and let it run for 1 hour and verify that only 24 JoinReq have been sent.
  • Let the device run for at least 10 hours and verify that only 48 JoinReq have been sent.
  • Let the device run for at least 24 hours and verify that only 53 (8.7s / 1.482s = ~5.8 -> 5) JoinReq have been sent.

@mluis1 at least we agree on the maths and this is what i have been working on lastly.
Except that I have my device do less requests than allowed. So I shall definitely not fall into DutyCycle-Restricted status.

Let me try to sum up my previous table or at least comment it:

  • Window 1 : From power up (T0) to Hour 1 (T0+1), it does 4 Join Requests
  • Window 2 : From Hour 1 (T0+1) to Hour 11 (T0+11), it does 10 more requests, for a subtotal of 14
  • Window 3 : From Hour 11 to 35 : 4 requests over the 5 allowed, which makes 18.
    The issue arises after this one on window 4 and after
  • Window 4 : from hour 35 to 35+24 : I get a Durty Cycle restricted status on the 2nd request! which is the 20st cumulated request.
  • Window 5 : from hour 35+24 to 35+24*2 : I fall into Duty Cycle Restricted status on the 4th request of this window (out of 5 allowed.

What is weird is that:

  • the first Duty Cycle Restricted status I get comes 6 requests after the begining of the first 24hrs window (the window starts with Join Request index 15 on my table and DC-restricted happens at index 20),
  • the second one arises 'again' 6 tries after the next successfull request (index 21 vs index 26 on my table)

This makes me wonder wether the maxcredit is refreshed after each period of 24hours or not.

Have you run your tests over a period > to 35 hours?

NB: Unfortunately I accidentally unpowered the testing device so I cannot confirm on the long run...

mluis1 commented

Please find attached a patch implementing what I described in my last post.

backoff_algorithm_update.patch.txt

Please ensure to reset the NVM memory once the updated firmware is updated on the end-device. The reason being that the NVM memory layout has changed.

It would be nice if everyone could test these changes and verify that the fix suits all mentioned needs.

@Regimbal
I have not run the tests for periods longer than 24 hours.
Concerning the issue after 24 hours it is possible that we have missed something. Help on debugging this is welcome.
Unfortunately, I do not have the time to further analyze/check the issue after 24 hours or to run tests.

@mluis1
I'll try to renew the test I began at first and see if the issue is really deterministic before going further.
After that it's another story, will do what I can to help :)