FreeRTOS/iot-reference-esp32

[BUG] coreMQTT keep alive handling fails and never reconnects

Closed this issue · 51 comments

Describe the bug
Please provide a clear and concise description explaining the bug.

System information

  • Hardware board: [ esp32s3 (Seeed Xiao) ]
  • IDE used: [ VSCode ]
  • Operating System: [ MacOS ]
  • Code version: (v202212.00-23-gd25036b
  • Project/Demo: [ Temperature LED demo with AWS pub/sub ]

Expected behavior
Expected behavior would be for the MQTT subsystem to continue retries until reconnected.

Screenshots or console output
image

Steps to reproduce bug
Example:
1. "I am using project [ ... ], and have configured with [ ... ]"
2. "When run on [ ... ], I observed that [ ... ]"

Code to reproduce bug

idf.py -p /dev/cu.usbmodem14101 flash monitor

Hey @lhammond, thanks for submitting this issue!
It appears this is an issue many people are running since #41, #45, #46, and #47 all appear to similar issues with the coreMQTT connection to the AWS IoT broker.

I've reached out to the team that works directly with coreMQTT and with the ESP32 boards to see if we can get to the bottom of what is causing these issues.

@Skptak Is there anything we can do to help push towards resolution? Should I be monitoring this situation in another place? Thank you!

@lhammond,
Can you please give a bit more insight on the problem, what version of esp-idf are you using? have you flashed your credential to the device and is your Thing name and endpoint at the correct account? Did you provision your device?

I was not able to reproduce your problem and here is a small sectional screenshot of my logs
Screenshot 2023-09-11 at 4 44 40 PM

For tour reference I followed this readme: https://github.com/FreeRTOS/iot-reference-esp32c3/blob/2dccbcad1a0e54ec2e32cc242d4bf4f4ab6c1274/GettingStartedGuide.md

Can you also please passte your skdconfig file for s3

@rawalexe Have you tested it over a long period of time?
When I tested it, this happened within four hours! When I reconfigured the network it was able to reconnect to broke, but the above problem still occurred after a while. There is another problem. After I unplug the AP's network cable, after a while, I plug it in again and it will no longer be able to actively report information.
Below is a screenshot of my log and sdkconfig.
1694506767402
1694507127378

1694507127363
image

Hi @rawalexe you can see the version at the top of this issue thread .. v202212.00-23-gd25036b
I am using the LED pub sub demo, not OTA.

You will see the issue I'm experiencing in the screenshot in the original post. If left alone, the device eventually disconnects and continues to output the "no command structure" forever.

Hey @lhammond, sorry for the delay in getting back to you. The team has been looking into this issue to try and provide support. I've ordered an ESP32-S3 so I can try and replicate your exact environment as we can't seem to replicate this issue on the ESP32-C3.

While I wait for the board to get here I'm wondering if you tried this potential fix that @ActoryOu mentioned in #46?

It seems like 1 second timeout is not enough for the device to finish TLS flow.
Could you help to set Featured FreeRTOS IoT Integration -> TLS Transport Send / Receive timeout in milliseconds to 10000 by idf.py menuconfig and retest?

I'm wondering if the timeout on the TLS transport send/receives might be what is causing the MQTT agent to go down.

Thanks again for your patience with this!

@gavin-hy what MCU are you running?

Hey @Skptak .. I'm away from my lab for a day and will try the potential fix you mention upon return. Thanks!

@gavin-hy what MCU are you running?

ESP32-C3

Hi @rawalexe you can see the version at the top of this issue thread .. v202212.00-23-gd25036b I am using the LED pub sub demo, not OTA.

You will see the issue I'm experiencing in the screenshot in the original post. If left alone, the device eventually disconnects and continues to output the "no command structure" forever.

Hello @lhammond and @gavin-hy
I am running all the demos, if you look into my attached screenshot. I'll try to replicate your issue by just running the temp sub pub over long period of time.

@lhammond can you please provide your whole skdconfig file for S3. With your endpoint removed. So that I have a 1-1 for replication your issue for S3.

Thank you,
AR

Hello @lhammond @gavin-hy,
I've tried killing the connection and then bring it back up, keeping it alive for long hours and am still not able to reproduce the bug.
Screenshot 2023-09-15 at 2 38 07 PM

Have you changed the code to any degree? Can you please send me a zip of your repo?

Best Regards,
AR

@Skptak there was no change in behavior by changing the TLS Transport Send / Receive timeout to 10000

@rawalexe yes, I have made some modifications. How can I send you the zip file?

@lhammond can you please provide your whole skdconfig file for S3. With your endpoint removed. So that I have a 1-1 for replication your issue for S3.

Thank you, AR

I am not using OTA demo nor S3 .. do you still need the sdkconfig?

@rawalexe @anubhavrawal It's too big to email, I just shared a google drive link to your email .. let me know if you can't down load it. I'm happy to get on a google meet if you'd like.

My edits were intended to comment out the publish loop ( not using a temp sensor ) and add a few helper functions to control a neopixel 16 ring. Here's my git status

image

@lhammond ,
Thank you for sharing the code, I was running all the demos to see if any other fail as well. However, for my other runs I disabled other demos from menuconfig and proceeded with the possible replication process of killing the internet connection.

Thank you for the file, I was able to download it and will try to replicate it today. I'll keep you posted on my progress

Best Regards,
AR

@lhammond
Can you please provide me with a little bit more context in this repo as some of things that I immediately notice is that you are not using the the right tag that you mentioned at the start of the ticket, you are on the main branch, also what's your esp-idf version number?

If you try to use the tagged version at commit 2dccbca with esp-idf 4.4.5 commit ac5d805d0e do you still see the issue?

Best Regards,
AR

@rawalexe
bash-3.2$ idf.py --version ESP-IDF v5.0.3-230-g35c484324f-dirty

I will try with the versions above and let you know

@rawalexe I am preparing to test with the new versions. I did want to point out that I am using a NeoPixel ring with (RMT - Addressable LED ) .. this option does not appear in menuconfig for commit 2dccbca. I'm guessing all of that functionality is implemented in the the demo's .c or app_driver.c and I can port it over.

image

@rawalexe After adding component_compile_options(-Wno-error=format= -Wno-format) to the bottom of main/CMakeLists.txt apparently due to espressif/esp-idf#9511 (comment)

I am seeing the below

image

@rawalexe I kept the publish while(true) loop but commented out the logic and it resolved the above sensor-related error. I have the pub/sub temperature LED demo running now with the versions you requested. I started it at 3:03 PM EST .. going to monitor it long term.

about 11 minutes into the test I get the following .. I will trying increasing the TLS timeout

image

@rawalexe there is no TLS timeout in this version .. but maybe CONNACK is the same .. I made these changes and rerunning the test

image

@rawalexe @anubhavrawal @Skptak

The versions above with a CONNACK of 10000 has been running for two days. I commented out the publish logic inside the while(true) loop.

So the question is, do I back port the LED demo logic to this version or is there a plan to fix latest branches to address the connectivity issue?

thanks

Hello @lhammond,
I see that there are few bugs in the repo, but it will take us sometime to look more into it and find a fix for this. If you can find the specific bug within the repo and submit a PR the team will be happy to merge a permanent fix, and will be the quickest fix.

Best Regards,
AR

@rawalexe ok, I'll see what I can do. I need to push these production devices out asap, so will probably backport the LED control logic first and will try to find some time to look around for the connectivity issue. Do you have an idea of which repo to look in? Is it a submodule or in this repo?

Would you guys be looking to apply any fixes to version 5.x?

The repo is aimed to work with the latest esp-idf. But after observing the issues for a while it looks like having a single submodule esp-idf might be a better idea and support for latest esp-idf will be at best effort. The next fixes will be to ensure full compatibility with v5.x.

txf- commented

I can confirm that the commit d25036b is definitely the cause of it not reconnecting. I had previously reported here #34 (comment), when it was still a patch.

After reverting the changes to the previous version of the agent manager, the device reconnected on any timeout or disconnection.

Hello @lhammond @txf-,
I have created this repo and tested out on my local device, can either of you test out to fit in your use cases?

https://github.com/rawalexe/iot-reference-esp32c3/tree/newEsp

It has few updated instructions and esp-idf v5.1.1 submodule. The are some build warnings but will be improved further.

Best Regards,
AR

txf- commented

The reconnection issues were fixed by the reversion of the optimizations in core_mqtt_agent_manager.c.

I can't actually tell what changes were made in newEsp that affects this. Is this repo just adjustments to make it work with idf 5.x?

Yes the changes are the documentation on using Amazon's version of FreeRTOS and submodule to latest esp-idf. I did not have any problem building the project or running them so am looking for verification that this works on the previously problematic scenarios.

Best Regards,
AR

Hello @lhammond ,
you would need at an esp idf 5.0+ to be able to switch kernel version. It should be under idf menuconfig > Component config > FreeRTOS > Kernel> Run the Amazon SMP FreeRTOS kernel instead (FEATURE UNDER DEVELOPMENT)

I would recommend using the submoduled esp-idf for standardization purpose but any 5.0+ idf should mostly behave the similar.

I am sorry but I was not able to see the attached image within the comments as it only shows like [image: image.png]

Thank you

Best Regards,
AR

@rawalexe ok, finally got it sorted and just started a long running test. stay tuned!

this happened after about 20 seconds. I increased CONNACK timeout to 20000 and trying again

image

That's. quite unfortunate, Give me some time I'll spend some time to see if I we can fix this.

Best Regards,
AR

Hi @rawalexe .. I'm back on this project again. Have you made any progress?

Hello @lhammond,
We forwarded this issue to espressif as they wanted this to be compatible with all the esp-idf versions 4.4+. I'll keep you posted once we hear back from them.

Hello @lhammond ,
After talking with espressif, they mentioned that adding process loop was indeed a known problem and noticed that the file you sent us didn't contain the commit reverting the process loop, commit id : f4fe11e27a7d686b7a2f22de278ece570b692ce9. I apologize to ask you to run these on different condition but just want to make sure that the known issues aren't creating any problems. Can you please make sure that you are using the latest changes in the main and still facing the issues? I attempted these changes on my device, didn't replicate test for long hours though and the demo ran as expected for 30 mins or so.

Best Regards,
AR

I provided the commit id to make sure that it's included in with the repo you are testing with. if that commit id is in your git log history your code should run without any problem. Now that you are actually running the demo, please let us know how it goes. If it fails can you also make sure that your internet connect isn't down by visiting a website, just in case

@rawalexe The LED/temperature demo at commit hash at f4fe11e has been running for about 3.5 days. I have not yet tried pulling the AP's network cable to check behavior, but this is encouraging. I will try that test sometime this weekend. If I understand your message, I need to make sure that commit has is in git log. I will try with main now.

Hello @lhammond,
That's a great news, and yes you do understand it correctly. Please let us know if this resolves your problem so that we can close the issue appropriately.

Best Regards,
AR

As there is no further concern from you, I am closing this issue as resolved, if the problem persists please feel free to reopen the issue or open a new one.