[BUG] core_mqtt_agent_manager retry loop after wifi connection
lhammond opened this issue · 16 comments
Describe the bug
Wifi connects, but then loops
System information
- Hardware board: [ esp32s3 (Seeed Xiao) ]
- IDE used: [ VSCode ]
- Operating System: [ MacOS ]
- Code version: (v202212.00-23-gd25036b
- Project/Demo: [ Temperature LED demo with AWS pub/sub ]
Expected behavior
Log entries that confirm that the TLS connection was successful and that MQTT messages are published as shown in the GettingStarted.md
Steps to reproduce bug
idf.py -p /dev/cu.usbmodem14101 flash monitor
Hey @lhammond, thanks for reaching out about this issue
This loop you are stuck in is generally caused by the TLS connection succeeding, but then not being able to make a proper MQTT connection to the endpoint. Have you verified that you can reach the endpoint from your host machine using the same certs as on the device?
I have the same problem, even creating a new thing and putting its certificates does not exit that loop
Hey, I managed to acquire a board and have tried out this demo. I'm able to successfully connect to my AWS IoT endpoint and reconnect. However, I added the Retain Publishes Policy to my policy for the device
{
"Effect": "Allow",
"Action": "iot:RetainPublish",
"Resource": "arn:aws:iot:<REGION>:*"
},
To the policies mentioned in the Creating an AWS IoT Policy in the AWSSetup.md file
Would you mind also adding this entry and seeing if you can then correctly connect to the server?
I tried an old version of this repository and it succeeds in sending the message to aws, although I haven't tried the ota
for iot-reference-esp32c3
git checkout e0cda47 .
and for esp-aws-iot
git checkout 1fc7681778bc271960a4e3db514a209df0380917 .
I guess I'll deal with the main and this policy
I'm having the same issue, have someone solved this?
Hi guys. After dig in to the history of the repository I was able connect to MQTT.
To make it work I reverted the PR#43, which has and update of the submodule esp-aws-iot that points to a version that seems to have a bug with TLS authentication (I don't looked too enough to be sure).
I also encountered the same problem after directly cloning this version and running it. And when replacing the previous esp-aws-iot component, the compilation fails.
I am not sure, but my issue may've been due to the board getting an IP address, but not actually having network reachability. I think this because, when I was having this issue, I was a guest on someone else's wifi and when I moved to my own network it started working. Unfortunately, I have no way to know for sure, but wanted to follow up.
Hello,
@lhammond thank you for reporting back. Indeed if the board is not allowed to acquire an IP address (restrictive policy on a router?), then you would see this issue.
@gavin-hy, @joaspacce, @jonth3425 would any of you mind trying this out with additional logging statements like the following by modifying this section of the codebase:
do
{
xTlsRet = xTlsConnect( pxNetworkContext );
if( xTlsRet == TLS_TRANSPORT_SUCCESS )
{
+ ESP_LOGE( TAG, "TLS connection succeeded\n");
if( esp_tls_get_conn_sockfd( pxNetworkContext->pxTls, &lSockFd ) == ESP_OK )
{
+ ESP_LOGE( TAG, "Got sockfd\n");
eMqttRet = prvCoreMqttAgentConnect( xCleanSession );
}
else
{
+ ESP_LOGE( TAG, "Failed to get sockfd\n");
eMqttRet = MQTTBadParameter;
}
if( eMqttRet != MQTTSuccess )
{
ESP_LOGE( TAG,
"MQTT_Status: %s",
MQTT_Status_strerror( eMqttRet ) );
}
}
+ else
+ {
+ ESP_LOGE( TAG, "TLS connection failed\n");
+ }
if( eMqttRet != MQTTSuccess )
{
xTlsDisconnect( pxNetworkContext );
xBackoffRet = prvBackoffForRetry( &xReconnectParams );
}
} while( ( eMqttRet != MQTTSuccess ) && ( xBackoffRet == pdPASS ) );This would allow us to pinpoint the issue and help us debug further. As @Skptak mentioned, this succeeds on our end and we cannot reproduce the error and would need your help in figuring out the error.
Thanks,
Aniruddha
I tried again from cloning the code to running the program. When cloning the project on my windows10 computer, as shown in Figure below, it directly shows that the cloning failed.

But it was successfully cloned on Amazon's virtual machine. As shown in Figure below,

after I copied the successful code to the local windows10, it ran successfully in EspressifIDE, and then "Retry attempt" was always displayed. The following is the output log
I (3637) wifi:set rx beacon pti, rx_bcn_pti: 14, bcn_timeout: 14, mt_pti: 25000, mt_time: 10000
I (3667) wifi:AP's beacon interval = 102400 us, DTIM period = 1
I (3677) wifi:idx:0 (ifx:0, b8:3a:08:ce:f3:70), tid:0, ssn:2, winSize:64
I (4597) wifi:idx:1 (ifx:0, b8:3a:08:ce:f3:70), tid:6, ssn:0, winSize:64
I (8647) core_mqtt_agent_manager: WiFi connected.
I (8647) app_wifi: Connected with IP Address:192.168.2.225
I (8647) esp_netif_handlers: sta ip: 192.168.2.225, mask: 255.255.255.0, gw: 192.168.2.1
I (10197) core_mqtt_agent_manager: Retry attempt 1.
I (11807) core_mqtt_agent_manager: Retry attempt 2.
I (14317) core_mqtt_agent_manager: Retry attempt 3.
I (17317) core_mqtt_agent_manager: Retry attempt 4.
I (20507) core_mqtt_agent_manager: Retry attempt 5.
I (26587) core_mqtt_agent_manager: Retry attempt 6.
I (28437) core_mqtt_agent_manager: Retry attempt 7.
I (33017) core_mqtt_agent_manager: Retry attempt 8.
I (35487) core_mqtt_agent_manager: Retry attempt 9.
I (37487) core_mqtt_agent_manager: Retry attempt 10.
I (38657) core_mqtt_agent_manager: Retry attempt 11.
I (42147) core_mqtt_agent_manager: Retry attempt 12.
I (48367) core_mqtt_agent_manager: Retry attempt 13.
I (51117) core_mqtt_agent_manager: Retry attempt 14.
I (55347) core_mqtt_agent_manager: Retry attempt 15.
I (61487) core_mqtt_agent_manager: Retry attempt 16.
I (66317) core_mqtt_agent_manager: Retry attempt 17.
I (70897) core_mqtt_agent_manager: Retry attempt 18.
I (73337) core_mqtt_agent_manager: Retry attempt 19.
I (79417) core_mqtt_agent_manager: Retry attempt 20.
I (82907) core_mqtt_agent_manager: Retry attempt 21.
I (85887) core_mqtt_agent_manager: Retry attempt 22.
I (87677) core_mqtt_agent_manager: Retry attempt 23.
I (92627) core_mqtt_agent_manager: Retry attempt 24.
I (95347) core_mqtt_agent_manager: Retry attempt 25.
I (99077) core_mqtt_agent_manager: Retry attempt 26
In addition, there is another problem that the project cannot be debugged. The project that comes with ESP-IDF can be debugged in EspressifIDE, but the project cannot be successfully debugged. The following is my configuration and error information.

Hello @gavin-hy,
Thank you for taking the time to report back.
I tried again from cloning the code to running the program. When cloning the project on my windows10 computer, as shown in Figure below, it directly shows that the cloning failed.
I am not sure why would that happen. Is it repeatable?
it ran successfully in EspressifIDE, and then "Retry attempt" was always displayed. The following is the output log
Thank you for running this, but did you make the changes to the code that I suggested above in this post? Those changes will allow the code the produce more logging showing us exactly what is failing and would help us figure out the issue.
In addition, there is another problem that the project cannot be debugged. The project that comes with ESP-IDF can be debugged in EspressifIDE, but the project cannot be successfully debugged. The following is my configuration and error information.
@ActoryOu would you mind taking a look at this? Does the information @gavin-hy provided seem correct?
Thanks,
Aniruddha
Hi, @AniruddhaKanhere
First of all, thank you very much for your reply.
Regarding the problem of cloning failure, the result is the same after several retries, and the size of the cloned project is only 56MByte.
Regarding adding more output logs, I also added them yesterday, and the added logs are not output, so I debugged today, below is the gif recorded while debugging.The state of the TLS layer is failed when connecting to Amazon.Of course, I can be sure that there is no problem with my certificate and policy, because I can connect with the previous version

Finally, I want to say that debugging in VSCode is really too difficult to use, and its response is very slow! Why is it so fast in Eclipse?
Hi @gavin-hy,
Thanks for reaching out! This does happen in my environment, either.
It seems like 1 second timeout is not enough for the device to finish TLS flow.
Could you help to set Featured FreeRTOS IoT Integration -> TLS Transport Send / Receive timeout in milliseconds to 10000 by idf.py menuconfig and retest?
Thanks.
Oh I see the issue regarding there being no additional logging. You have added the else in the incorrect location. It is an else corresponding to if( xTlsRet == TLS_TRANSPORT_SUCCESS ) but you mistakenly have added it corresponding to if( eMqttRet != MQTTSuccess ).
But, regardless, we now know that the TLS connect is failing and the issue is not related to coreMQTT-Agent or coreMQTT.
I am not sure why would that be happening though. Can you run wireshark on your computer and use the computer's hotspot to provide connectivity to the esp32?
Wireshark will tell us why is the TLS connection failing - if a packet ever gets sent.
Thanks again,
Aniruddha
Hi, @AniruddhaKanhere , @ActoryOu,
Thank you very much for your reply.
After debugging in the morning, I changed the value of "Receive timeout in milliseconds" to 2000 at the suggestion of my colleagues. I have successfully connected to Amazon.
Thank you for your reply again.


