Unpredictable System Hangs

I could sure use feedback from a more experienced developer…

Use-Case…
I needed to get a remote building (Well Shed) connected to my Cabin based internet server over a long distance with no line of site so I built a radio WiFi relay tower and installed some Ubiquiti equipment to link the two buildings together. This works fine. The relay tower is solar based and has a lead acid battery powering it. But, it is in a remote location and difficult to access. I want to monitor the battery state over time to make sure the small solar panel can keep it charged properly with solar blocking weather. And I don’t want to add an ethernet switch to provide internet connectivity at the tower as that would just put another (unwanted) load on the solar system there. So, LoRa to the rescue…

I’ve built two Heltec ESP32-LoRa devices (v2). One, at the relay tower, has voltage and amperage sensors attached that monitor the battery and send out LoRa messages on a periodic basis. The other is at the cabin and receives the LoRa messages, parses them, creates MQTT messages, attaches to my local WiFi, and posts the MQTT messages to a local MQTT broker running in the cabin. Then, a RPi reads the MQTT messages, formats a SQL insert statement, and process it into a local MariaDB table.

I have Grafana server running in the cabin that collects data from the MariaDB table and formats a nice “system status” web page that shows the status of the relay tower battery as well as 5 other solar systems running on my property. It all works just fine, except…

The Heltec device located at the relay tower randomly hangs up, and I have no idea why. I’ve done a fair amount of troubleshooting and can’t find the cause of the hangs. So I’ve added a watch dog process to the code running on the Heltec device. That at least reboots the device (so far) and keeps things running.

But, I am not happy with that process, just like I used to hate the ctl-alt-del “solution” to WinDoze issues, which I solved by going to Unix based systems for most of my work. Anyway…

I use the Arduino IDE (v2) to develop my code, so it’s possible that could be causing issues with the ESP32-LoRa device, but I doubt it. It’s more likely some issue with a memory leak in my code, or some funky pin issue. Or, I could just have some bug in the code I can’t see.

So, if you’ve read this far…HELP! Below is the code I’m running, and a sample of the serial monitor output I get when debugging. If you see something in either that makes you suspicious PLEASE let me know. I could really use the help.

It seems to me that something is going off the rails after I execute the Amp read code block, so it could be in there, or in the radio send code block, I just can’t tell…

Also, one more general question…at the end of the code there is a statement (Radio.IrqProcess()) that triggers the interrupt process that is intended to trap the result of the transmission. This then invokes the handler routines to process things and put the ESP32-LoRa into a light sleep state to save battery power. This line seems to be in the wrong place to me (I got it from an example). I think it should actually go right before or after the Radio.Send command. But if I move the line to either place the OnTxDone block never gets called. Does the Radio.IrqProcess() function need to be called all the time, or at least for every send process? Or does it simply turn on the interrupt process initially?

Code Block
/* This code reads a Volt sensor and an Amp sensor attached to the Relay Tower battery,
* formats the results, and sends out a LoRa packet that should be consumed and processed
* by a sister LoRa device (using WiFi & LoRa) which will receive the packet, generate
* an MQTT message, and send it to the MFR MQTT server. Then, a script running on a RPi
* should fire and process the MQTT message and insert the data into an MFR MariaDB
* datebase. The data is then processed by a MFR Grafana server which displays the status
* of the Relay Tower battery powering the internet relay going to the Well Shed.
*/

#include "LoRaWan_APP.h"
#include "Arduino.h"
#include <WiFi.h>
#include <esp_sleep.h>
#include <esp_task_wdt.h>

#define LORA_FIX_LENGTH_PAYLOAD_ON false        // Payload length varies
#define LORA_IQ_INVERSION_ON false              //
#define RF_FREQUENCY 915000000                  // Hz
#define TX_OUTPUT_POWER 14                      // dBm
#define LORA_BANDWIDTH 0                        // [0: 125 kHz, 1: 250 kHz, 2: 500 kHz, 3: Reserved]
#define LORA_SPREADING_FACTOR 7                 // [SF7..SF12]
#define LORA_CODINGRATE 1                       // [1: 4/5, 2: 4/6, 3: 4/7, 4: 4/8]
#define LORA_PREAMBLE_LENGTH 8                  // Same for Tx and Rx
#define LORA_TX_TIMEOUT 3000                    // Tx timeout milliseconds
#define BUFFER_SIZE 30                          // Define the payload size here
#define WDT_TIMEOUT 600                         // Watch Dog Timeout in seconds, make sure > sleepTime!
#define CONFIG_ESP_SYSTEM_PANIC_PRINT_REBOOT 1  // Should provide a dump if there's a system panic
int esp_bluedroid_disable();                    // BLE OFF, Save Power I Hope
int esp_bluedroid_deinit();                     // BLE OFF, Save Power I Hope
int esp_bt_controller_disable();                // BLE OFF, Save Power I Hope
int esp_bt_controller_deinit();                 // BLE OFF, Save Power I Hope
int esp_wifi_stop();                            // WiFi OFF, Save Power I Hope
char txpacket[BUFFER_SIZE];                     // transmission send packet size
const int sleepTime = 5;                        // Time to sleep (in minutes)
int volt = 0;                                   // read the sensor input
double voltage = 0;                             // map 0-4095 (12bit) to 0-1730 (10bit) adjusted
int voltAnalogInputPin = 36;                    // Pin to measure Voltage Value
int currentAnalogInputPin = 37;                 // Pin to measure Current Value
int calibrationPin = 38;                        // Pin to calibrate offset middle value
float manualOffset = 0.00;                      // Key in value to manually offset the initial value
float mVperAmpValue = 62.5;                     // 10A 2.5V +/- 0.625V Hall Sensor, mVperAmp = 625mV / 10A = 62.5mV/A
float supplyVoltage = 3300;                     // Pin maximum voltage, Uno/Mega 5000mV, Nano/Node MCU 3300mV
float offsetSampleRead = 0;                     // read the value of a sample for offset purpose later
float currentSampleRead = 0;                    // read the value of a sample including currentOffset1 value
float currentLastSample = 0;                    // count time for sample. Every 1 milli second 1 sample is taken
float currentSampleSum = 0;                     // accumulation of sample readings
float currentSampleCount = 0;                   // count number of samples
float currentMean;                              // calculate the average value from all samples, in analog values
float RMSCurrentMean;                           // square roof of currentMean, in analog values
float FinalRMSCurrent;                          // the final RMS current reading
bool ampLoop = true;                            // Amp reading loop
bool lora_idle = true;                          // LoRa Idle flag
static RadioEvents_t RadioEvents;

/*
  In this config I've set the WDT to 10 minutes and the ESP32 sleep timer to 5 minutes. This
  is done so I can try to detect how often the ESP32 is hanging and requiring a WDT reboot.
  The other end of the LoRa communication gets the message sent from this unit and loads it
  into a DB. I can read the DB and find gaps between messages being send that are greater than
  the normal 5 minutes, and those are the system hang-reboot counts...
*/

void OnTxDone(void) {
  Serial.print("Packet sent ... ");
  delay(100);
  lora_idle = true;
  sleepLightly();
}

void OnTxTimeout(void) {
  Serial.println("Packet Timed Out......");
  delay(100);
  lora_idle = true;
  sleepLightly();
}

void sleepLightly() {
  // Go into Light Sleep mode to save battery
  esp_sleep_enable_timer_wakeup(sleepTime * (60 * 1000000));  // 1000000 microseconds = 1 second
  delay(100);
  Serial.println("Nap time!\r\n");
  delay(100);
  esp_light_sleep_start();
  esp_sleep_wakeup_cause_t wakeup_reason;
  wakeup_reason = esp_sleep_get_wakeup_cause();
  switch (wakeup_reason) {
    case ESP_SLEEP_WAKEUP_EXT0: Serial.print("Wakeup external RTC_IO"); break;
    case ESP_SLEEP_WAKEUP_EXT1: Serial.print("Wakeup external RTC_CNTL"); break;
    case ESP_SLEEP_WAKEUP_TIMER: Serial.print("Wakeup by sleep timer"); break;
    case ESP_SLEEP_WAKEUP_TOUCHPAD: Serial.print("Wakeup by touchpad"); break;
    case ESP_SLEEP_WAKEUP_ULP: Serial.print("Wakeup by ULP program"); break;
    default: Serial.printf("Unexpected wakeup was : %d", wakeup_reason); break;
  }
  esp_task_wdt_reset();
  Serial.println(" ... WDT Reset.");
}

void setup() {
  Serial.begin(115200);
  delay(2500);  // If using the Arduino serial monitor gives me time to clear it.
  Serial.println("-----------------\r\n| System Booted |\r\n-----------------");
  pinMode(voltAnalogInputPin, INPUT);
  pinMode(currentAnalogInputPin, INPUT);
  pinMode(calibrationPin, INPUT);
  esp_task_wdt_init(WDT_TIMEOUT, true);  //enable panic so ESP32 restarts
  esp_task_wdt_add(NULL);                //add current thread to WDT watch
  Serial.println("WDT Started...");
  Mcu.begin();                          // Initiate the MCU
  RadioEvents.TxDone = OnTxDone;        // Link Tx result stub routine
  RadioEvents.TxTimeout = OnTxTimeout;  // Link Tx timeout stub routine
  Radio.Init(&RadioEvents);
  Radio.SetChannel(RF_FREQUENCY);
  Radio.SetTxConfig(MODEM_LORA, TX_OUTPUT_POWER, 0, LORA_BANDWIDTH, LORA_SPREADING_FACTOR,
                    LORA_CODINGRATE, LORA_PREAMBLE_LENGTH, LORA_FIX_LENGTH_PAYLOAD_ON,
                    true, 0, 0, LORA_IQ_INVERSION_ON, LORA_TX_TIMEOUT);
}

void loop() {
  if (lora_idle == true) {
    // Read voltage
    Serial.println("Reading Voltage...");
    volt = analogRead(voltAnalogInputPin);  // read the voltage sensor input
    voltage = map(volt, 0, 4095, 0, 1725);  // this needs tweaking for proper calibration
    voltage /= 100;                         // divide by 100 to get decimal values
    //Read the Amperage
    ampLoop = true;
    Serial.println("Reading Amps...");
    while (ampLoop) {
      if (micros() >= currentLastSample + 200)  // every 0.2 milli second taking 1 reading
      {
        currentSampleRead = analogRead(currentAnalogInputPin) - analogRead(calibrationPin);  // read sample value including offset value
        currentSampleSum = currentSampleSum + sq(currentSampleRead);                         // accumulate total values for each readings
        currentSampleCount = currentSampleCount + 1;                                         // count and move on
        currentLastSample = micros();                                                        // reset the time
      }
      if (currentSampleCount >= 2500)  // 2500 samples (200*2500=500k milli seconds aka 0.5 seconds)
      {
        currentMean = currentSampleSum / currentSampleCount;                                           // average accumulated analog values
        RMSCurrentMean = sqrt(currentMean);                                                            // square root of the average value
        FinalRMSCurrent = (((RMSCurrentMean / 4095) * supplyVoltage) / mVperAmpValue) - manualOffset;  // calculate the final RMS current
        currentSampleSum = 0;                                                                          // reset accumulate sample values for the next cycle
        currentSampleCount = 0;                                                                        // reset number of sample for the next cycle
        ampLoop = false;
      }
    }
    sprintf(txpacket, "Volts:%0.2f Amps:%0.2f", voltage, FinalRMSCurrent);  // create a packet
    Serial.printf("Sending packet \"%s\" ... ", txpacket);
    Radio.Send((uint8_t *)txpacket, strlen(txpacket));  // send the LoRa package out
    lora_idle = false;
  }
  Radio.IrqProcess();  // Capture Tx sent status
}

Serial Monitor Output

10:19:57.231 -> -----------------
10:19:57.231 -> | System Booted |
10:19:57.231 -> -----------------
10:19:57.231 -> WDT Started...
10:19:57.396 -> Reading Voltage...
10:19:57.396 -> Reading Amps...
10:19:58.352 -> Sending packet "Volts:13.13 Amps:0.10" ... Packet sent ... Nap time!
10:19:58.618 -> 
10:24:57.018 -> Wakeup by sleep timer ... WDT Reset.
10:24:57.062 -> Reading Voltage...
10:24:57.062 -> Reading Amps...
10:24:57.909 -> Sending packet "Volts:13.29 Amps:0.09" ... Packet sent ... Nap time!
10:24:58.172 -> 
10:29:57.281 -> Wakeup by sleep timer ... WDT Reset.
10:29:57.281 -> Reading Voltage...
10:29:57.281 -> Reading Amps...
10:29:58.070 -> Sending packet "Volts:13.13 Amps:0.10" ... Packet sent ... Nap time!
10:29:58.332 -> 

Bla, Bla, Bla…

11:34:59.570 -> Wakeup by sleep timer ... WDT Reset.
11:34:59.570 -> Reading Voltage...
11:34:59.570 -> Reading Amps...
11:44:59.583 -> E (631250) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
11:44:59.583 -> E (631250) task_wdt:  - loopTask (CPU 1)
11:44:59.583 -> E (631250) task_wdt: Tasks currently running:
11:44:59.583 -> E (631250) task_wdt: CPU 0: IDLE
11:44:59.649 -> E (631250) task_wdt: CPU 1: loopTask
11:44:59.649 -> E (631250) task_wdt: Aborting.
11:44:59.649 -> 
11:44:59.649 -> abort() was called at PC 0x400de375 on core 0
11:44:59.649 -> 
11:44:59.649 -> 
11:44:59.649 -> Backtrace:0x400839f9:0x3ffbe8ec |<-CORRUPTED
11:44:59.649 -> 
11:44:59.649 -> 
11:44:59.649 -> 
11:44:59.649 -> 
11:44:59.649 -> ELF file SHA256: 0000000000000000
11:44:59.649 -> 
11:44:59.649 -> Rebooting...
11:44:59.649 -> ets Jun  8 2016 00:22:57
11:44:59.649 -> 
11:44:59.649 -> rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
11:44:59.649 -> configsip: 0, SPIWP:0xee
11:44:59.649 -> clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
11:44:59.652 -> mode:DIO, clock div:1
11:44:59.652 -> load:0x3fff0030,len:1344
11:44:59.652 -> load:0x40078000,len:13864
11:44:59.652 -> load:0x40080400,len:3608
11:44:59.652 -> entry 0x400805f0
11:45:02.293 -> -----------------
11:45:02.293 -> | System Booted |
11:45:02.293 -> -----------------
11:45:02.294 -> WDT Started...
11:45:02.460 -> Reading Voltage...
11:45:02.460 -> Reading Amps...
11:45:03.387 -> Sending packet "Volts:13.22 Amps:0.11" ... Packet sent ... Nap time!
11:45:03.652 ->

Well, well…I may have found the issue. In the ampLoop process I use micros() to deal with the timing of my amp read accumulation. Turns out that there may be an issue with its reliability on ESP32 vs Arduino. In the Arduino world micros() overflows after about 70 minutes and resets to zero! And guess what, that’s about the time my Heltec hangs up and reboots. I will explore this and see if there is a better timer on the ESP32, or I’ll time things in a different way…I’ll post my results…

So there may be other issues, but changing the loop micros() check to the following resolved the issue…

if ((unsigned long)micros() - time_now >= 200)

The micros() counter was hitting its max and reseting which was causing the routine to hang, which triggered the WDT. So, I’ve learned something about cloning ESP32 code into an Arduino IDE environment…where these kinds of things are a little different. Solved…

2 Likes