[SOLVED] CubeCell AB01 and LoRaWAN: frame counter is not incrementing after few days of running time

superslot · December 4, 2020, 10:56am

@ernst.schulz : this bug is not related to TTN.

we are discussing the usage of CubeCell_AB01 node togheter with a well known lora gateway: ChirpStack.io

you can use ChirpStack standalone … locally … or as a relay to the TTN network …

in my case is just a local gateway collecting payloads from multiple nodes (including CubeCell devices)

Supporter · December 7, 2020, 3:18am

But we also don’t think it is code about frame counter. We tested LoRaWan for ab01(the cycle is 15s, the running time is about 3 weeks) the frame counter reached a large value more than 100 thousands with no problem.
And we also use the same loramac code on esp32, it works for about a year on a project.
When the frame counter error occured， what is the value about frame counter in the server?

Supporter · December 7, 2020, 3:27am

How did you installed the environment of CubeCell？ From Arduino IDE or git？
And can you have a try the same code with no sensor ？

superslot · December 7, 2020, 10:36am

I did install from Arduino.

One question on this: if I want to run out of the git repo is it sufficient to swap the folder under Arduino/packages/CubeCell with the one from git repo?

Also another update:
after some more debug I think that your suggestion of missing an ACK might be the correct one.

I did add in purpose a 200ms delay after the Lora.Send() call in the loop from the example and I can replicate the same issue all the time: chirpstack is refusing all the uploads AFTER THE VERY FIRST because the FCN is not incrementing.

I’ve added a MIN_DR=4 to the ChirpStack profile of my application (so I do not need to change the code in CubeCell ) … I’ll wait 2-3 days and report back.

Supporter · December 8, 2020, 9:17am

delay only could be added before LoRaWAN.send(). Added after LoRaWAN.send(), the chip will too late to process event that if there is event it is processed in lorawan.sleep() .

Once the chirpstack receviced a confirmed uplink data first time , server will send ack to node . If the node missed the ack, it will send repeat data several times (which is all the same with the first). Because the server have receviced the data once , the repeat data will be regard as error uplick with fcnt not increase .

superslot · December 9, 2020, 7:00am

yup, very well understood.

so given that the issue is in my code … here is the problem.

on my sensor I do not use LoraWAN all the time. The sensor does few controls from time to time, and only when an event is happening it will trigger a lorawan cycle.

So I wanted to use the same example we have in the CubeCell repo … but I cannot use an endless while loop. Once I’m done sending data I need to quit the loop and continue doing the rest of the code.

Since we have a fairly cumbersome state machine between the LoRaWan code and the LoRaMAC code I could not find a “clean” way to signal that the send process was done and stop that loop.

so what I did is to add a timeout in the sleep state, see attached code (4s).

now this might be the problem in my code: from time to time I might miss the ack because after 4s I’m quitting the while loop … even if 4s seems sufficient to me for all the LoRaWan states that I’m facing (join, send, recv ack)

do you see a better way to make this while loop as a “self contained” call? exiting once the send/recv_ack is done?

or should I just wait a little bit more? 6s ?

#define LORAWAN_TIMEOUT (4*ONESEC_MSECS) //(7s)
void lorawanLoop()
{
volatile bool loopDone = false;
unsigned long loraWanTimeout = 0;
unsigned long tmpTime = 0;

  bool send_f = false;

  /* make sure we don't alter the very firs run of the LoRaWAN lib */
  if(!firstRun_f)
  {
  	/* check if we need to request time */
  	if(efslog_sys_flags & CMD_NTP_TIME_REQUEST_MSK)
  	{
  		timeReq_f = true;
  	}

      if((alarm_flag) || (0==(efslog_idx % LoraSendInterval)))
      {
        	LOG_MSGLN("[LoRa] Sending Data Now..."); 	
        	send_f = true;
  		LoRaWAN.stopNextPacketTimer();
  	}
  }
  else
  {
  	if(efslog_idx > LoraSendInterval)
  	{
  		firstRun_f = 0;
  	}
  }
  
  /*
   * inner LoRaWAN library loop
   */
  while(loopDone == false)
  {
  	switch( deviceState )
  	{
  		case DEVICE_STATE_INIT:
  		{			
  			LoRaWAN.init(loraWanClass,loraWanRegion);
  			deviceState = DEVICE_STATE_JOIN;
  			break;
  		}
  		case DEVICE_STATE_JOIN:
  		{	
  			LoRaWAN.join();
  			break;
  		}
  		case DEVICE_STATE_SEND:
  		{
  			LOG_MSGLN("[LoRaWAN: SEND] ");

  			if(timeReq_f)
  			{
  				appPort = DEVPORT;
        			TimerSysTime_t sysTimeCurrent = TimerGetSysTime( );
                  timeReq_f = false;
          		MlmeReq_t mlmeReq;  
          		mlmeReq.Type = MLME_DEVICE_TIME;
          		LoRaMacMlmeRequest( &mlmeReq );
        		}

        		appPort = DEVPORT;
  			prepareTxFrame( appPort );
  			LoRaWAN.send();
  			deviceState = DEVICE_STATE_CYCLE;

  			break;
  		}
  		case DEVICE_STATE_CYCLE:
  		{
  			// Schedule next packet transmission
  			txDutyCycleTime = appTxDutyCycle + randr( 0, APP_TX_DUTYCYCLE_RND );
  			LoRaWAN.cycle(txDutyCycleTime);
  			deviceState = DEVICE_STATE_SLEEP;
  			break;
  		}
  		case DEVICE_STATE_SLEEP:
  		{
  			/******** TIMEOUT CHECK & EXIT *******/
  			if(rtcTimeIsSync())
  			{
  				tmpTime = millis();
  				if(loraWanTimeout == 0)
  				{
  					loraWanTimeout = tmpTime;
  				}
  				else if( (tmpTime - loraWanTimeout) >= LORAWAN_TIMEOUT )
  				{
  					loopDone = true;
  				}
  			}

  			/******* SEND EVENT ************/
  		      if (send_f) {
  		        if (IsLoRaMacNetworkJoined) {
  				  	appPort = APPPORT;
  		          	if(prepareMyFrame(appPort)) {
  		            	LoRaWAN.send();
  		          	}
  		        }
  		        send_f = false;
  		      }
  		    /*****/

                          LoRaWAN.sleep();
  			break;
  		}
  		default:
  		{
  			deviceState = DEVICE_STATE_INIT;
  			break;
  		}
  	}
  }

}

Supporter · December 9, 2020, 4:24am

It seems LoRaWAN.sleep() have been removed in case DEVICE_STATE_SLEEP? Without LoRaWAN.sleep(), loramac event can’t be processed. If you don’t want to go into deepsleep, you can use Radio.IrqProcess( ) instead of LoRaWAN.sleep().

An easy way I think you can set confirmedNbTrials to 1， that if the node missed an ack， it would not auto send repeat， that the LORAWAN_TIMEOUT 4s is enough.

superslot · December 9, 2020, 7:04am

mmm … no sorry just a copy&paste error. the sleep call is there (I basically used the code from the Lorawan interrupt example)

I would like to use the confirmed message and use alro repeat to be more robust, is there a way to indicate to the state machine that we missed the ack and so we have to repeat? if yes I can simply reset the timeout to zero and that should work…

Supporter · December 9, 2020, 8:47am

Sorry, I forgot your payload length is 120byte, LORAWAN_TIMEOUT should be caculated. Radio.TimeOnAir(MODEM_LORA,your_payload_length)/1000+2+4;
Radio.TimeOnAir(MODEM_LORA,your_payload_length) is the packet time in millisecond, the 2 is the 2nd RxWindow delay， the 4 is the delay to start repeat when missed ack.

Can you put up all your ino file? With part of code I have no idea.

superslot · December 9, 2020, 9:25am

excellent ! but I have a different (simpler?) idea …

in the code you can see that when I enter the DEVICE_STATE_SLEEP I initialize a loraWanTimeout counter based on millis() call.
in general, if there is no missed ack and no repeat, a 3s or 4s (as it is in my define) timeout count is sufficient (and indeed it works fine for a long time…)
but if I miss one ack … this is where I pbly have the bug since I will terminate the while loop after 4s (defined in LORAWAN_TIMEOUT) no matter what.

is there a flag, or a callback in the lora library that I can use to detect that there was a missed ack?

if I have this signal I can extend my counter adding 4s to my wait each time that I miss one ack…

thanks a lot for your help

Supporter · December 10, 2020, 1:53am

The function “static void McpsIndication( McpsIndication_t *mcpsIndication )” would be called once received a downlink, and we weak declare the function “void attribute((weak)) downLinkDataHandle(McpsIndication_t *mcpsIndication)” which can be redeclared , there is an example named LoRaWan_downlinkdatahandle.

I think you can have a try this: declare a flag, and set it to false before lorawan.send() , and redeclare downLinkDataHandle() and set the flag to true in downLinkDataHandle() . If ack is receiced the flag becomes true.

If ack is missed, a longer time is needed to finish the repetition based on the value of confirmedNbTrials.

superslot · December 10, 2020, 7:24am

FANTASTIC!

this is what I was looking for…
I will implement, test and report back in few days.

superslot · December 10, 2020, 11:12am

Hi,
I tried your suggestion but the downLinkDataHandle() does NOT trigger when you receive an ACK. I can get it to trigger by sending real data from the gateway to the device (like a real downlink payload) but when there is an ACK over the air I don;t get any callback.

we are getting closer … this is definitely the callback that we need even to identify if we got the ack or not … so maybe can be added something similar to the lower library just to signal “ACK/NO_ACK” ??

thanks

superslot · December 10, 2020, 12:34pm

and also I think that we do need a callback to monitor the ACK failure even after the retry count is over.

if we don’t … how can we guarantee 24/7 functionality when the CONFIRMED_UPLOAD option is used??

I our lorawan library let’s say the following:

we use confirmed upload
we send data and we miss the ack
the lib does multiple retry but still loses the ack

so now the FCN on the device is wrong (not aligned anymore)… what do we do at the next time around?

if we have a callback to monitor that we missed all the acks we can trigger a re-join and start from scratch.

and this would happen also if the gateway goes down (for whatever reason).
when the gw comes back up … we missed the ack … and we start a re-join

===
please let me know if this makes sense to you or if I’m missing some functionality of the LoRaWAN.h library.

geppoleppo · December 11, 2020, 10:25am

I participate in the discussion also because I have the same problem on 3 HTCC.AB02 nodes that read the sensors on the ADC pins.
After a few days the nodes stop transmitting (even if the battery is continuously discharged).
I still don’t understand what is going on but when the bug occurs I will check on the Lorawan server if packets arrive with the wrong frame counter and then they are discarded.
I always send unconfirmed Uplink.
As loraserver I don’t use TTN but a commercial product: https: //www.resiot.io

superslot · December 12, 2020, 12:08pm

I think I have a possible workaround using the McpsConfirm() function of the LoraWanApp.c.

this is triggered each time I get an ACK and so by monitoring (in my app) the FCN and ACK I can decide if I need to wait more in my loop for the next retry or if we missed all the acks and we need to trigger a JOIN (and this would take care also of the case of missing/rebooting gateway…)

I’m testing and I will update this thread in few days…

Supporter · December 13, 2020, 10:00am

sorry, I made a mistake. The flag should be setted in “McpsIndication( McpsIndication_t *mcpsIndication )”. the downlinkDatahandle() only runs if there are data in received downlink

superslot · December 18, 2020, 5:10pm

Ok, after a week of testing 24/7 we can pbly claim there is a workaround for this.

I’m putting here my notes just in case someone needs some hint, pbly there is a more clean way of integrating these features in mainline code.

in LoRaWan_APP.cpp I use an external function to report back to my app when an ACK is received (this could be a weak link if we have a forma api for a call like this…)

extern void myLoRaWanFCNCheck(uint32_t currFCN, bool ackReceived, uint8_t NbRetries);

in LoRaWan_APP.cpp I added the call into the McpsConfirm() function like this:

static void McpsConfirm( McpsConfirm_t *mcpsConfirm )
{
if( mcpsConfirm->Status == LORAMAC_EVENT_INFO_STATUS_OK )
{
switch( mcpsConfirm->McpsRequest )
{
case MCPS_UNCONFIRMED:
{
// Check Datarate
// Check TxPower
break;
}
case MCPS_CONFIRMED:
{
efestoLoRaWanFCNCheck( mcpsConfirm->UpLinkCounter, mcpsConfirm->AckReceived, mcpsConfirm->NbRetries);
// Check Datarate
// Check TxPower
// Check AckReceived
// Check NbTrials
break;
}
case MCPS_PROPRIETARY:
{
break;
}
default:
break;
}
}
nextTx = true;
}

As I mentioned before for my specific application I do not need to send messages at every wakeup of the device. I send messages only in specific situations and so I needed a way to terminate the code loop that we have in the default lorawan example (see example of sending messages with interrupt).

I’m using confirmed messages and so we need a way to either wait or terminate the loop based on the ACK/NO_ACK feedback.

If we missed one ack … we can wait until all the retry are done. If there are no more retry … and we did not receive any ack … we trigger a re-join, which will take care of setting the FCN to zero.

This also should take care of the use case when the gateway router goes down and eventually comes back and we need to re-authenticate to it.

here is my version of the code in my file (called myLoRaWAN.ino)

void lorawanLoop()
{
volatile bool loopDone = false;
unsigned long loraWanTimeout = 0;
unsigned long loraWanTimeoutMax = (4*ONESEC_MSECS);
unsigned long tmpTime = 0;

bool send_f = false;


/*
 * inner LoRaWAN library loop
 */
ackReceived = true;              /* this will be set before each send call                */
ackWait     = confirmedNbTrials; /* by default we wait until ack is received or quit  */
while(loopDone == false)
{
	switch( deviceState )
	{
		case DEVICE_STATE_INIT:
		{			
			loraWanTimeout = 0;

			//printDevParam();
			LoRaWAN.init(loraWanClass,loraWanRegion);
			deviceState = DEVICE_STATE_JOIN;
			break;
		}
		case DEVICE_STATE_JOIN:
		{
			loraWanTimeout = 0; 

			LOG_MSGLN("[LoRaWAN: Join] ");			
			LoRaWAN.join();
			break;
		}
		case DEVICE_STATE_SEND:
		{
			loraWanTimeout = 0;

			LOG_MSGLN("[LoRaWAN: SEND] ");

			if(timeReq_f)
			{
				appPort = DEVPORT;
      			TimerSysTime_t sysTimeCurrent = TimerGetSysTime( );
      			LOG_PRINTF("[TIME] Current Unix time:%u.%d\r\n",(unsigned int)sysTimeCurrent.Seconds, sysTimeCurrent.SubSeconds);
                timeReq_f = false;
        		MlmeReq_t mlmeReq;  
        		mlmeReq.Type = MLME_DEVICE_TIME;
        		LoRaMacMlmeRequest( &mlmeReq );
      		}

      		appPort = DEVPORT;
			prepareTxFrame( appPort );

     		ackReceived = false;
			LoRaWAN.send();
			loraWanTimeout = 0;
			deviceState = DEVICE_STATE_CYCLE;

			break;
		}
		case DEVICE_STATE_CYCLE:
		{
			loraWanTimeout = 0;

			// Schedule next packet transmission
			txDutyCycleTime = appTxDutyCycle + randr( 0, APP_TX_DUTYCYCLE_RND );
			LoRaWAN.cycle(txDutyCycleTime);
			deviceState = DEVICE_STATE_SLEEP;
			break;
		}
		case DEVICE_STATE_SLEEP:
		{
			/*******/
		      if (send_f) {
		        if (IsLoRaMacNetworkJoined) {
				  	appPort = APPPORT;
		          	if(prepareEfestoFrame(appPort)) {
		          		ackReceived = false;
		            	LoRaWAN.send();
		            	loraWanTimeout = 0;
		          	}
		        }
		        send_f = false;
		      }
		      /*****/

			if(rtcTimeIsSync())
			{
				tmpTime = millis();
				if(loraWanTimeout == 0)
				{
					loraWanTimeout = tmpTime;
				}

				if(ackReceived)
				{
					//Serial.println("ACK RECEIVED or NO_ACK_REQ, quit the loop.");
					loopDone = true;
				}					
				else if( (tmpTime - loraWanTimeout) >= LORAWAN_TIMEOUT )
				{
					if(ackWait > 0)
					{
						//Serial.println("ACK MISSED, timeout extend.");

						ackWait--;
						loopDone = false;
						loraWanTimeout = 0;
					}
					else
					{
						//Serial.println("ACK TOTALLY MISSED, re-join ");

						/* this will trigger a re-join when we miss ack & send again */
						deviceState = DEVICE_STATE_INIT;
						send_f = true;
						loopDone = false;	
					}
				}
			}

			LoRaWAN.sleep();
			break;
		}
		default:
		{
			deviceState = DEVICE_STATE_INIT;
			break;
		}
	}
}

}

and of course at the beginning of the file the flags are initialized as follow:

/* flag to requequest time */
static bool timeReq_f = 1;
static bool firstRun_f = 1;

/* handles missing hack */
static int8_t ackWait = 0;
static bool ackReceived = false;
static uint32_t ackCurrFCN = 0;

superslot · December 18, 2020, 5:13pm

note: not sure how to better format the code on the post …

the rest of the file is identical to the LoRaWAN example.

I tested this workaround version in parallel to the original code and while the original one did miss the ACK (generating the issue in the topic above) this one never had any problem of frame_count_number while using all CONFIRMED messages.