[BUG] TLS connections with non_block or short timeouts are getting dropped frequently

Question

[BUG] TLS connections with non_block or short timeouts are getting dropped frequently

Closed this issue 2 years ago · 12 comments

Describe the bug
TLS connections are frequently dropped. I need a relatively high data rate publish, with blocking mode the default 3000ms timeout connects, but it blocks 3 seconds waiting for receive even if nothing is being received. Trying to enable non_block one needs to increase the timeout to something of the order of 10000 ms, in order to reliably establish a connection at first try. However, non_block will lead to frequently dropped connections, at somewhat indeterministic rates, sometimes it will drop after every message, sometimes it survives hours.

System information

Hardware board: ESP32 custom board
IDE used: VSCODE with extension
Operating System: [ Windows WSL2 ]
Code version: v202212.00
Project/Demo: custom application

Expected behavior
I'd expect to be able to publish at a fairly reliable rate, without dropping the connection, I have not observed these issues with other MQTT libraries.

Screenshots or console output

W (1964209) mbedtls: ssl_msg.c:2866 0x3fff5b28: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

W (1964210) mbedtls: ssl_msg.c:5766 0x3fff5b28: mbedtls_ssl_write_record() returned -26752 (-0x6880)

W (1964228) MQTTWrap: Error or timed out waiting for ack for publish message 9827. Re-attempting publish.
I (1964259) esp-tls: connecting...

I enabled debug output for mbedtls. MQTTWrap is just a class that handles publishes and subscribes, it is heavily based on the demo functions.

Steps to reproduce bug
All that one needs to do is to have a short .timeout_ms value or .non_block enabled in network_transport. OTA with the default demo settings will trigger it, as will increasing the publish rate to something of the order of 9hz.

Answer 1 · 2023-03-21T17:14:54.000Z

Hi, I am contacting the proper internal expert and we will get back to you as soon as possible.

Answer 2 · 2023-03-21T17:58:20.000Z

@txf- Does the same thing occur with QoS0 rather than QoS1 publishes?

Answer 3 · 2023-03-21T18:04:34.000Z

same thing occur with QoS0

Sorry, I should have noted, this happens mostly with with QoS0, it is where I can get the fastest publish rate. I don't know if OTA Job messages are QoS1, if they are then it happens with that too.

Answer 4 · 2023-03-21T18:53:30.000Z

@txf- Can you share the exact changes made to the transport implementation?

Answer 5 · 2023-03-21T21:32:59.000Z

In "components/esp-aws-iot/libraries/coreMQTT/port/network_transport/network_transport.c"

in xTlsConnect() I changed the tls config struct. As posted below, with non_block=true , timeout_ms has to be set fairly large in order to have some degree of connection success on the first try (and not trigger backoff delays in the Agent connection task).

Alternatively with non_block=false, timeout_ms has to be set low <100 ms, in order for espTlsTransportRecv() not to block unreasonably long. Currently in the repo it is set to 1000, meaning it will block for 1 second regardless of whether there is anything to be received.

    esp_tls_cfg_t xEspTlsConfig = {
        .cacert_buf = (const unsigned char*) ( pxNetworkContext->pcServerRootCAPem ),
        .cacert_bytes = strlen( pxNetworkContext->pcServerRootCAPem ) + 1,
        .clientcert_buf = (const unsigned char*) ( pxNetworkContext->pcClientCertPem ),
        .clientcert_bytes = strlen( pxNetworkContext->pcClientCertPem ) + 1,
        .skip_common_name = pxNetworkContext->disableSni,
        .alpn_protos = pxNetworkContext->pAlpnProtos,
        .non_block = true,
#if CONFIG_CORE_MQTT_USE_SECURE_ELEMENT
        .use_secure_element = true,
#elif CONFIG_CORE_MQTT_USE_DS_PERIPHERAL
        .ds_data = pxNetworkContext->ds_data,
#else
        .use_secure_element = false,
        .ds_data = NULL,
        .clientkey_buf = ( const unsigned char* )( pxNetworkContext->pcClientKeyPem ),
        .clientkey_bytes = strlen( pxNetworkContext->pcClientKeyPem ) + 1,
#endif
        .timeout_ms = 15000
    };

I recall somebody mentioning that OTA in this reference implementation is configured the way it is because of some issue with ESP-TLS. I don't know if that is related.

Edit: here is some additional debug information. I turned on logging info in mbedtls and in the coremqtt agent manager, MQTTWrap is handling the publish function:

I (37440) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (37452) core_mqtt_agent_manager: TLS connection was disconnected.
I (37456) esp-tls: connecting...
I (39177) core_mqtt_agent_manager: Session present: 1

I (39188) core_mqtt_agent_manager: coreMQTT-Agent connected.
W (158706) mbedtls: ssl_msg.c:2866 0x3fff3750: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

W (158707) mbedtls: ssl_msg.c:5766 0x3fff3750: mbedtls_ssl_write_record() returned -26752 (-0x6880)

W (158725) MQTTWrap: Error or timed out waiting for ack for publish message 1321. Re-attempting publish.
I (158725) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (158737) core_mqtt_agent_manager: TLS connection was disconnected.
I (158828) esp-tls: connecting...
I (161491) core_mqtt_agent_manager: Session present: 1

I (161501) core_mqtt_agent_manager: coreMQTT-Agent connected.
W (173301) mbedtls: ssl_msg.c:2866 0x3fff2a18: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

W (173301) mbedtls: ssl_msg.c:5766 0x3fff2a18: mbedtls_ssl_write_record() returned -26752 (-0x6880)

W (173319) MQTTWrap: Error or timed out waiting for ack for publish message 1467. Re-attempting publish.
I (173319) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (173331) core_mqtt_agent_manager: TLS connection was disconnected.
I (173335) esp-tls: connecting...
I (174970) core_mqtt_agent_manager: Session present: 1

I (174981) core_mqtt_agent_manager: coreMQTT-Agent connected.

I can't seem to find what that mbedtls error code is.

Answer 6 · 2023-03-22T16:23:37.000Z

0x6880 corresponds to MBEDTLS_ERR_SSL_WANT_WRITE which indicates that the write or read operation should be retried after a delay.

Are you using esp_tls_conn_new_sync or esp_tls_conn_new_async?

esp_tls_conn_new_sync is probably the righh choice here.

Answer 7 · 2023-03-22T17:53:00.000Z

Are you using esp_tls_conn_new_sync or esp_tls_conn_new_async?

esp_tls_conn_new_sync is probably the righh choice here.

using the same as reference, esp_tls_conn_new_sync(). In that function, I did hack up a bodge for the situation of a blocking connection with a very small timeout:

size_t timeout_ticks = pdMS_TO_TICKS(cfg->timeout_ms*20);

Answer 8 · 2023-03-22T20:47:54.000Z

@txf- I've reproduced the problems you described.

There are two reasons for the poor performance you noted:

The MQTTAgent_ProcessLoop function is never called, which results in very slow receive performance.
The call to esp_tls_conn_read in blocking mode always takes the entire timeout time to return any data.

You can find a proof of concept fix for these two issues in the following branches. I'll work on optimizing and cleaning these up in the next couple days:

https://github.com/paulbartell/esp-aws-iot/tree/hack/non_blocking_transport

https://github.com/paulbartell/iot-reference-esp32c3/tree/wip/add_ProcessLoop_call

Answer 9 · 2023-03-23T09:41:24.000Z

You can find a proof of concept fix for these two issues in the following branches. I'll work on optimizing and cleaning these up in the next couple days:

https://github.com/paulbartell/esp-aws-iot/tree/hack/non_blocking_transport

I tried this particular change. It is both better and sometimes worse. Two cases:

In most situations I get a lot of these messages:

W (171487) mbedtls: ssl_msg.c:2866 0x3ffee29c: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

W (171487) mbedtls: ssl_msg.c:5766 0x3ffee29c: mbedtls_ssl_write_record() returned -26752 (-0x6880)

W (171497) mbedtls: ssl_msg.c:5749 0x3ffee29c: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

W (171507) mbedtls: ssl_msg.c:5749 0x3ffee29c: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

W (171517) mbedtls: ssl_msg.c:5749 0x3ffee29c: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

Communication and the tls connection is maintained after these messages.

Another time it spammed 50-100 of mbedtls_ssl_flush_output() returned -26752 (-0x6880) and then eventually took down wifi:

W (72610) mbedtls: ssl_msg.c:5749 0x3ffed9c0: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

I (72619) wifi:ap_probe_send over, resett wifi status to disassoc
I (72625) wifi:state: run -> init (c800)
I (72629) wifi:pm stop, total sleep time: 25124924 us / 49853590 us

I (72635) wifi:<ba-del>idx
I (72637) wifi:<ba-del>idx
I (72640) wifi:new:<5,0>, old:<5,0>, ap:<255,255>, sta:<5,0>, prof:1
I (72646) core_mqtt_agent_manager: WiFi disconnected.
W (72652) mbedtls: ssl_msg.c:5749 0x3ffed9c0: mbedtls_ssl_flush_output() returned -26752 (-0x6880)

I (72652) smartconfig: Wifi Disconnected
I (72666) smartconfig: Wifi Connecting
W (72670) mbedtls: ssl_msg.c:5749 0x3ffed9c0: mbedtls_ssl_flush_output() returned -78 (-0x004e)

E (72680) esp-tls-mbedtls: write error :-0x004E:
W (72685) MQTTWrap: Error or timed out waiting for ack for publish message 132. Re-attempting publish.

Is this second case a different problem, and the flush messages represent attempts to send until the timeout is reached?

Answer 10 · 2023-03-23T17:56:20.000Z

@txf- This is a downside of choosing a non-blocking socket: lack of back-pressure. Replacing the vTaskDelay call in espTlsTransportSend with a select loop should address this.

QoS1 publishes require a significantly larger amount of resources, so you may be better served with QoS0 publishes.

How frequently are you attempting to publish? Is you code waiting for the callback from MqttAgent before attempting to send the next message?

Answer 11 · 2023-03-24T09:02:49.000Z

QoS1 publishes require a significantly larger amount of resources, so you may be better served with QoS0 publishes.

I am actually using QoS0 messages when I need fast publish rates. I use QoS1 for more infrequent messages.

How frequently are you attempting to publish? Is you code waiting for the callback from MqttAgent before attempting to send the next message?

I currently have a sensor that has an acquisition rate of 10Hz.

There are two modes where I send sensor data, the first mode I buffer a few seconds worth of processed data before sending it, in a "large" message. I use QoS1 there and these messages present no issues whatsoever. This is the mode that will be active 99% of the time

The second mode, is a temporary mode that only runs for a short period of time, here I need to send the raw sensor data and the processed data at the acquisition rate. In this "streaming" mode I use QoS0 for the reasons you stated, but this is the mode that creates the problems that led to this bug report. For the most part I am achieving this transmission rate, especially so with now with your non_blocking changes. I do wait for the Callbacks, the publish functions are very similar to reference, in that they block and wait for the agent notification.

Edit: I should note that even when reducing the rate to about 3Hz, I still periodically get moments where I get a flood of mbedtls_ssl_flush_output() and mbedtls_ssl_write_record().

Answer 12 · 2023-04-18T13:16:28.000Z

@paulbartell

https://github.com/paulbartell/iot-reference-esp32c3/tree/wip/add_ProcessLoop_call

I have tested this change. It definitely improves performance, however it does have a downside, in that if the connection is dropped or the esp is disconnected from the network, it does not attempt to reestablish the connection.

This is a downside of choosing a non-blocking socket: lack of back-pressure. Replacing the vTaskDelay call in espTlsTransportSend with a select loop should address this.

How would I do this? My knowledge of Sockets is quite limited.

edit: I managed to make this change myself. Posting the changes, in case anybody stumbles into this issue:

int32_t espTlsTransportSend( NetworkContext_t* pxNetworkContext,
                             const void* pvData, size_t uxDataLen)
{
    int lSockFd = -1;
    fd_set write_fds;
    fd_set errorSet;

    struct timeval timeout = { .tv_usec = 10000, .tv_sec = 0 };

    int32_t lBytesSent = -1;

    if( ( pvData != NULL ) &&
        ( uxDataLen > 0 ) &&
        ( pxNetworkContext != NULL ) &&
        ( pxNetworkContext->pxTls != NULL ) )
    {
        TimeOut_t xTimeout;
        TickType_t xTicksToWait = pxNetworkContext->xTimeout;

        vTaskSetTimeOutState( &xTimeout );

        if( xSemaphoreTake(pxNetworkContext->xTlsContextSemaphore, xTicksToWait) == pdTRUE )
        {
            lBytesSent = 0;
            if( esp_tls_get_conn_sockfd( pxNetworkContext->pxTls, &lSockFd ) == ESP_OK ){
                //ESP_LOGI("transport", "got socket fd");               
            }
            do
            {
                FD_ZERO(&write_fds);
                FD_SET(lSockFd, &write_fds);

                FD_ZERO( &errorSet );
                FD_SET( lSockFd, &errorSet );
                int select_ret = select( lSockFd + 1, NULL,&write_fds, &errorSet, &timeout );
                if (select_ret == -1) {
                    // handle select() error
                    ESP_LOGI("transport", "select() err");
                    continue;
                } else if (select_ret == 0) {
                    //ESP_LOGI("transport", "select() timeout");
                    // handle select() timeout
                    continue;
                }

                ssize_t lResult = esp_tls_conn_write(pxNetworkContext->pxTls, pvData, uxDataLen);

                if( lResult > 0 )
                {
                    lBytesSent += ( int32_t ) lResult;
                }
                else if( ( lResult != MBEDTLS_ERR_SSL_WANT_WRITE ) &&
                        ( lResult != MBEDTLS_ERR_SSL_WANT_READ ) )
                {
                    lBytesSent = lResult;
                }

                if( ( lBytesSent < 0 ) ||
                    ( lBytesSent == uxDataLen ) )
                {
                    break;
                }
                //vTaskDelay( 1 );
            }
            while( xTaskCheckForTimeOut( &xTimeout, &xTicksToWait ) == pdFALSE );

            xSemaphoreGive(pxNetworkContext->xTlsContextSemaphore);
        }
    }
    return lBytesSent;
}

What I said previously still applies, so this may not be the best implementation. Feel free to correct me here.