Excessive data consumption
coratron opened this issue · 15 comments
I am using the following:
- SDK version as supported by https://github.com/espressif/esp-aws-iot/tree/release/202210.01-LTS
- Platform: ESP32
- CAT-M1 modem + PPPoS
- AWS IoT Core + MQTT on port 8883
A. Before migrating the project to this new stack I had all the comms (unencrypted) running on the Modem by using AT commands. Data consumption for the operation: 1kB
B. After migrating to the embedded SDK (using PKCS#11 for managing certificates, etc). Data consumption for the operation: 19kB. I am not talking about the initial 30kB for the TLS handshake etc, but just the ongoing MQTT transmission.
C. From B. but changing the following
#define TRANSPORT_SEND_RECV_TIMEOUT_MS 5000U
to an absurd value of:
#define TRANSPORT_SEND_RECV_TIMEOUT_MS 50000U
The data consumption for the operation : 4kB
What is going on here?
The ideal goal is to achieve the minimum amount of data consumption as possible as we are using SIM cards that are billed by the MB. I would expect a higher consumption (mostly due to the handshake), but not the ongoing 4x compared to what we used to have before migrating the project to use this sdk and certificates.
Any ideas?
The only thing I can see on the terminal at the moment is a periodic:
D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
D (690227) coreMQTT: MQTT PINGREQ packet size is 2.
D (690233) coreMQTT: sendBuffer: Bytes Sent=2, Bytes Remaining=0
D (690243) coreMQTT: Sent 2 bytes of PINGREQ packet.
I am wondering if this is causing some extra data consumption in the background.
If the cause is that PINGREQ is occurring frequently, then I think the following patch is needed:
FreeRTOS/coreMQTT#235
If the cause is that PINGREQ is occurring frequently, then I think the following patch is needed: FreeRTOS/coreMQTT#235
thanks for your prompt reply @YOSI-yoshidayuji , I was precisely looking at the keep alive conditions and the other defines that can override the values. Then following your link did find out that indeed this is missing in that LTS release which most certainly would have an impact:
if( status == MQTTSuccess )
{
pContext->lastPacketRxTime = pContext->getTime();
}
I have added that manually for now and will do some tests, but may have to look at using the master branch. Unfortunately, nobody at ESP is maintaining the esp-aws-iot repo... The issues keep piling up hence why I sought help here.
No changes unfortunately, still 4kB.
The following was appearing every 50s. Not a coincidence.
D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
D (690227) coreMQTT: MQTT PINGREQ packet size is 2.
D (690233) coreMQTT: sendBuffer: Bytes Sent=2, Bytes Remaining=0
D (690243) coreMQTT: Sent 2 bytes of PINGREQ packet.
That TRANSPORT_SEND_RECV_TIMEOUT_MS is what gets passed to
static MbedtlsPkcs11Status_t configureMbedtls( MbedtlsPkcs11Context_t * pMbedtlsPkcs11Context, const char * pHostName, const MbedtlsPkcs11Credentials_t * pMbedtlsPkcs11Credentials, uint32_t recvTimeoutMs )
I have increased the value to 120s now and the message on the terminal indeed appears roughly every 120s. The data consumption is further reduced to 3kB.
Whatever is happening in the background in terms of transactions after this odd timeout occurs is what is driving this unwanted data consumption. Seems to be TLS related.
Can you track where this log is coming from:
D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
The reason I ask it might help us in finding out if some operation is being retried and resulting in more data consumption.
It comes from mbedtls_pkcs11_posix.c:
int32_t Mbedtls_Pkcs11_Recv( NetworkContext_t * pNetworkContext,
void * pBuffer,
size_t bytesToRecv )
{
MbedtlsPkcs11Context_t * pMbedtlsPkcs11Context = NULL;
int32_t tlsStatus = 0;
assert( ( pNetworkContext != NULL ) && ( pNetworkContext->pParams != NULL ) );
pMbedtlsPkcs11Context = pNetworkContext->pParams;
tlsStatus = ( int32_t ) mbedtls_ssl_read( &( pMbedtlsPkcs11Context->context ),
pBuffer,
bytesToRecv );
if( ( tlsStatus == MBEDTLS_ERR_SSL_TIMEOUT ) ||
( tlsStatus == MBEDTLS_ERR_SSL_WANT_READ ) ||
( tlsStatus == MBEDTLS_ERR_SSL_WANT_WRITE ) )
{
LogDebug( ( "Failed to read data. However, a read can be retried on this error. "
"mbedTLSError= %s : %s.",
mbedtlsHighLevelCodeOrDefault( tlsStatus ),
mbedtlsLowLevelCodeOrDefault( tlsStatus ) ) );
/* Mark these set of errors as a timeout. The libraries may retry read
* on these errors. */
tlsStatus = 0;
}
else if( tlsStatus < 0 )
{
LogError( ( "Failed to read data: mbedTLSError= %s : %s.",
mbedtlsHighLevelCodeOrDefault( tlsStatus ),
mbedtlsLowLevelCodeOrDefault( tlsStatus ) ) );
}
else
{
/* Empty else marker. */
}
return tlsStatus;
}
which in turn is the function that I assigned to transport.recv
/* Fill in TransportInterface send and receive function pointers.
* For this demo, TCP sockets are used to send and receive data
* from the network. pNetworkContext is an SSL context for OpenSSL.*/
transport.pNetworkContext = pNetworkContext;
transport.send = Mbedtls_Pkcs11_Send;
transport.recv = Mbedtls_Pkcs11_Recv;
transport.writev = NULL;
in mqtt_operations.c, just like it is done in the fleet provisioning example:
Okay - do we know what read operation fails. In other words, do we have a call stack when the read fails? Is there any possibility of capturing network traffic? Also, are you using QoS1? If yes, can you try QoS0?
This is over CAT-M1, I do not have the tools to capture this traffic unfortunately. I may have to switch over to WiFi if it comes to that but metering would be done differently (unable to use the original reference for comparison).
In any case, the increased data consumption appears to be the symptom of this timeout error.
Here is the call stack.
0x400e2ac5: Mbedtls_Pkcs11_Recv at mbedtls_pkcs11_posix.c:965 (discriminator 34)
0x400e8d8a: receiveSingleIteration coreMQTT/coreMQTT/source/core_mqtt.c:1658
0x400e92d0: MQTT_ProcessLoop at coreMQTT/coreMQTT/source/core_mqtt.c:3071
0x400dde31: processMqtt() see below...
0x400d9432: networkingTask(void*)
void processMqtt(void)
{
MQTTStatus_t mqttStatus = MQTTBadParameter;
/* Acquire the mqtt mutex lock. */
if(pthread_mutex_lock(&mqttMutex) == 0)
{
/* Loop to receive packet from transport interface. */
mqttStatus = MQTT_ProcessLoop(&mqttContext);
pthread_mutex_unlock(&mqttMutex);
}
if(mqttStatus != MQTTSuccess)
{
LogError(("Failed to process MQTT loop with error = %u.", mqttStatus));
}
else
{
LogInfo(("Processed MQTT loop successfully."));
}
}
I call processMqtt() periodically to handle any pending MQTT traffic in my networkingTask
What happens is the system stalls for the duration TRANSPORT_SEND_RECV_TIMEOUT_MS . When that ends up in failure (timeout), the system must be sending something.
I was using QoS1 but I have tried QoS0 and it produces the same results.
I call processMqtt() periodically to handle any pending MQTT traffic in my networkingTask
What happens is the system stalls for the duration TRANSPORT_SEND_RECV_TIMEOUT_MS . When that ends up in failure (timeout), the system must be sending something.
Does it mean that there is no data to read and this receive is expected to timeout? We need to still find out why the amount of data consumption is more. Can you log the number of bytes sent and bytes received in the transport send and receive functions and compare them?
@aggarg Thanks for bearing with me.
I would assume that this timeout indeed should not be a problem if there is no data to receive. The observation is the data consumption is directly proportional to the timeout condition, I did not know what data is being exchanged here. I finally managed to enable the mbedTLS debug logging - (platform specific vs library...).
Cornering the problem a little further. Here are the new observations.
I get 5 messages like this (one message for every processMqtt() call as discussed previously).
I (418920) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (418920) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned -26624 (-0x6800)
I (418924) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_fetch_input() returned -26624 (-0x6800)
I (418933) mbedTLS: |1| 0x3ffb49e4: ssl_get_next_record() returned -26624 (-0x6800)
I (418943) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_read_record() returned -26624 (-0x6800)
D (418951) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
Backtrace: ... (same as before)
I (418985) mbedTLS: |2| 0x3ffb49e4: => read
I (418985) mbedTLS: |2| 0x3ffb49e4: => read record
I (418991) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (418992) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
And then the 6th looks like this:
I (534185) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (534185) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned -26624 (-0x6800)
I (534192) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_fetch_input() returned -26624 (-0x6800)
I (534200) mbedTLS: |1| 0x3ffb49e4: ssl_get_next_record() returned -26624 (-0x6800)
I (534208) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_read_record() returned -26624 (-0x6800)
D (534215) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
Backtrace: ... (same as before)
I (534240) mbedTLS: |2| 0x3ffb49e4: => write
I (534245) mbedTLS: |2| 0x3ffb49e4: => write record
I (534250) mbedTLS: |2| 0x3ffb49e4: => encrypt buf
I (534256) mbedTLS: |4| 0x3ffb49e4: dumping 'before encrypt: output payload' (2 bytes)
I (534265) mbedTLS: |4| 0x3ffb49e4: 0000: c0 00 ..
I (534274) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used (internal)' (12 bytes)
I (534282) mbedTLS: |4| 0x3ffb49e4: 0000: ed 28 2b 61 00 00 00 00 00 00 00 2c .(+a.......,
I (534292) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used (transmitted)' (8 bytes)
I (534300) mbedTLS: |4| 0x3ffb49e4: 0000: 00 00 00 00 00 00 00 2c .......,
I (534313) mbedTLS: |4| 0x3ffb49e4: dumping 'additional data used for AEAD' (13 bytes)
I (534321) mbedTLS: |4| 0x3ffb49e4: 0000: 00 00 00 00 00 00 00 2c 17 03 03 00 02 .......,.....
I (534330) mbedTLS: |3| 0x3ffb49e4: before encrypt: msglen = 2, including 0 bytes of padding
I (534339) mbedTLS: |4| 0x3ffb49e4: dumping 'after encrypt: tag' (16 bytes)
I (534353) mbedTLS: |4| 0x3ffb49e4: 0000: c4 51 34 23 49 fc d6 2d 2d d7 a6 d6 8f d4 8a 43 .Q4#I..--......C
I (534361) mbedTLS: |2| 0x3ffb49e4: <= encrypt buf
I (534363) mbedTLS: |3| 0x3ffb49e4: output record: msgtype = 23, version = [3:3], msglen = 26
I (534372) mbedTLS: |4| 0x3ffb49e4: dumping 'output record sent to network' (31 bytes)
I (534381) mbedTLS: |4| 0x3ffb49e4: 0000: 17 03 03 00 1a 00 00 00 00 00 00 00 2c f5 8a c4 ............,...
I (534394) mbedTLS: |4| 0x3ffb49e4: 0010: 51 34 23 49 fc d6 2d 2d d7 a6 d6 8f d4 8a 43 Q4#I..--......C
I (534402) mbedTLS: |2| 0x3ffb49e4: => flush output
I (534410) mbedTLS: |2| 0x3ffb49e4: message length: 31, out_left: 31
I (534418) mbedTLS: |2| 0x3ffb49e4: ssl->f_send() returned 31 (-0xffffffe1)
I (534423) mbedTLS: |2| 0x3ffb49e4: <= flush output
I (534434) mbedTLS: |2| 0x3ffb49e4: <= write record
I (534435) mbedTLS: |2| 0x3ffb49e4: <= write
I (534442) mbedTLS: |2| 0x3ffb49e4: => read
I (534450) mbedTLS: |2| 0x3ffb49e4: => read record
I (534452) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (534458) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (535394) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (535394) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned 5 (-0xfffffffb)
I (535401) mbedTLS: |2| 0x3ffb49e4: <= fetch input
I (535404) mbedTLS: |4| 0x3ffb49e4: dumping 'input record header' (5 bytes)
I (535411) mbedTLS: |4| 0x3ffb49e4: 0000: 17 03 03 00 1a .....
I (535421) mbedTLS: |3| 0x3ffb49e4: input record: msgtype = 23, version = [0x303], msglen = 26
I (535434) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (535442) mbedTLS: |2| 0x3ffb49e4: in_left: 5, nb_want: 31
I (535443) mbedTLS: |2| 0x3ffb49e4: in_left: 5, nb_want: 31
I (535450) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned 26 (-0xffffffe6)
I (535458) mbedTLS: |2| 0x3ffb49e4: <= fetch input
I (535463) mbedTLS: |4| 0x3ffb49e4: dumping 'input record from network' (31 bytes)
I (535474) mbedTLS: |4| 0x3ffb49e4: 0000: 17 03 03 00 1a 00 00 00 00 00 00 00 13 07 c7 bc ................
I (535482) mbedTLS: |4| 0x3ffb49e4: 0010: c1 42 da 3d de f3 a6 3c b5 ac 2a df 30 e3 a8 .B.=...<..*.0..
I (535493) mbedTLS: |2| 0x3ffb49e4: => decrypt buf
I (535499) mbedTLS: |4| 0x3ffb49e4: dumping 'additional data used for AEAD' (13 bytes)
I (535507) mbedTLS: |4| 0x3ffb49e4: 0000: 00 00 00 00 00 00 00 13 17 03 03 00 02 .............
I (535522) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used' (12 bytes)
I (535525) mbedTLS: |4| 0x3ffb49e4: 0000: ca 77 ab 33 00 00 00 00 00 00 00 13 .w.3........
I (535535) mbedTLS: |4| 0x3ffb49e4: dumping 'TAG used' (16 bytes)
I (535542) mbedTLS: |4| 0x3ffb49e4: 0000: bc c1 42 da 3d de f3 a6 3c b5 ac 2a df 30 e3 a8 ..B.=...<..*.0..
I (535553) mbedTLS: |2| 0x3ffb49e4: <= decrypt buf
I (535563) mbedTLS: |4| 0x3ffb49e4: dumping 'input payload after decrypt' (2 bytes)
I (535571) mbedTLS: |4| 0x3ffb49e4: 0000: d0 00 ..
I (535579) mbedTLS: |2| 0x3ffb49e4: <= read record
I (535582) mbedTLS: |2| 0x3ffb49e4: <= read
I (536393) mbedTLS: |2| 0x3ffb49e4: => read
I (536394) mbedTLS: |2| 0x3ffb49e4: => read record
I (536395) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (536401) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
Then it starts all over.
The observation is the data consumption is directly proportional to the timeout condition, I did not know what data is being exchanged here.
How are you measuring this data consumption?
Directly from the carrier (that's how they bill), I am using their API as well to get higher resolution - but can only see data in chunks of about 15 min.
@aggarg just found this 6th packet is happening for a ping request by enabling coreMQTT logs as well.
D (302432) coreMQTT: MQTT PINGREQ packet size is 2.
It seems I am back to @YOSI-yoshidayuji 's suggestion. I will have to check if that patch has been fully implemented in the version I am using.
Ok, I think we can put this to rest. I have removed PINGREQ for now. Data consumption is 3kB per operation which is substantially higher than what I expected but will have to settle for that. I can't see what else but TLS is responsible for the added data.
Thanks for the help @aggarg and @YOSI-yoshidayuji
I can't see what else but TLS is responsible for the added data.
I did not realize that you were not using encryption before. If you want to further dive deep, we would need some way to capture data. One way can be to setup MQTT broker on a PC and run capture on that.
Thank you for sharing your solution!