aws/aws-iot-device-sdk-python-v2

Publish future raises AWS_ERROR_MQTT_CONNECTION_DESTROYED

edcloudcycle opened this issue ยท 15 comments

Describe the bug

In our design I add a callback to the future returned by the publish method:

publish_future, pkt_id = self._mqtt_connection.publish(topic, pkt[1], mqtt.QoS.AT_LEAST_ONCE, retain=False)
publish_future.add_done_callback(lambda x: self._publish_future_callback(x, pkt, topic))   

This normally works fine but I have been testing the code for robustness using a repeating cycle in and out of connected modes and on one occasion I got this exception: AWS_ERROR_MQTT_CONNECTION_DESTROYED

After this the sdk still appears to respond correctly but never publishes any packets. It even indicates it is resuming a connection but I don't think it really is.

I have added a workaround to recreate the connection after I get this exception but I'd like to understand why it occurs.

Expected Behavior

For this exception to not get thrown and if it does for other library methods to then fail instead of appearing to work.

Current Behavior

Occasionally throws AWS_ERROR_MQTT_CONNECTION_DESTROYED and then fails to publish afterwards although the library seems to respond like it is working:

exception calling callback for <Future at 0xffff74059160 state=finished raised AwsCrtError>
Traceback (most recent call last):
  File "/bba_app/aws_iot_core.py", line 154, in _publish_future_callback
    result = publish_future.result(timeout=0.0)
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
awscrt.exceptions.AwsCrtError: AWS_ERROR_MQTT_CONNECTION_DESTROYED: Connection has started destroying process, all uncompleted requests will fail.

Reproduction Steps

  1. Create connection
  2. Start connection
  3. Publish every 10 seconds for 5 minutes
  4. Shutdown connection
  5. Idle 2 minutes
  6. Repeat

Possible Solution

Are the AWSCRT objects persistent even with create context calls?

Additional Information/Context

Can someone explain what this message means and what can cause it? If there is an error in how I am using the sdk then perhaps that will make it clear.

SDK version used

1.11.3

Environment details (OS name and version, etc.)

Debian, Python 3.8

Thank you for making this issue! A few questions:

  • When did the AWS_ERROR_MQTT_CONNECTION_DESTROYED error occur? Did it occur on the disconnection, re-connection, etc?
    • If it occurred on the disconnection, how are you shutting down the connection?
  • When you say it never publishes any more packets, do you mean packets prior to getting AWS_ERROR_MQTT_CONNECTION_DESTROYED or all packets afterwards?

Doing a quick search through the code, we only use AWS_ERROR_MQTT_CONNECTION_DESTROYED in a single location and that is when we're destroying a client connection and have pending requests. We use AWS_ERROR_MQTT_CONNECTION_DESTROYED to signify that the connection was destroyed and therefore the pending requests will never get a response from the server.

Are the AWSCRT objects persistent even with create context calls?

To the best of my knowledge, the code will keep a reference to objects and keep them alive as long as necessary to perform the desired operation. This is to prevent memory suddenly disappearing if we do something asynchronously. I'm not sure if that is what is going on in this case though.

Hi,
The message occurs during the creation of a connection. I am shutting down using:
try: self._mqtt_connection.disconnect() except: pass
Perhaps I should add some debug here just to check if it does throw.
It occurs to me that a good use of this exception would be to put the unsent packet back on the queue of packets for use on reconnection.

It looks like the not sending packets symptom might be related to my code falling out of its correct state machine path and not recovering after the exception. I have tightened that up and will start testing again now.
Thanks very much for your response.
Ed

Thank you for the additional information! If sending packets continue to fail, please let me know and I will start taking a closer look to see if I can reproduce the issue ๐Ÿ™‚

@edcloudcycle Are you getting the AWS_ERROR_MQTT_CONNECTION_DESTROYED error anymore?

Yes I do see it still in my cycle testing which involves bringing the system up and down far more frequently than when deployed, effectively this means a loop per the pseudo code in the attached file, sorry code rendering did not work well for me in this message.
Perhaps I should be destructing the connection object explicitly after I disconnect?

aws_example.zip

Hi @jmklix, just an update from me, after more than a week of running on the cycle above at an accelerated rate I get:
1048 cycles
159 times I got the destroyed message
So 15.2%
There doesn't seem to be a pattern, see the attached image showing the vscode search results and where the detected term appears in the log file. The device probably has better or worse signal strength over time so it could just reflect that combined with some timing issue I don't understand.
image

Hi @jmklix,
Having pushed the system a bit harder I can make it through this error that I don't understand, any suggestions on where to look?
Signal received: 1852270931, errno: 1914727521
################################################################################
Stack trace:
################################################################################
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(aws_backtrace_print+0x4c) [0xffff9a7cc610]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0x6f598) [0xffff9a747598]
linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffa60e45c0]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(aws_backtrace_print+0x28) [0xffff9a7cc5ec]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0x6f598) [0xffff9a747598]
linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffa60e45c0]
/usr/bin/python3() [0x4bed60]
/usr/bin/python3(PyUnicode_FromFormatV+0x838) [0x4b80ec]
/usr/bin/python3() [0x5995fc]
/usr/bin/python3(_PyErr_Format+0x68) [0x59edd8]
/usr/bin/python3(_PyObject_MakeTpCall+0x7d4) [0x4a5748]
/usr/bin/python3(_PyObject_CallFunction_SizeT+0x224) [0x4dce44]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0x70d58) [0xffff9a748d58]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0x806e8) [0xffff9a7586e8]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0xcd7c4) [0xffff9a7a57c4]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0xd1fd4) [0xffff9a7a9fd4]
/usr/local/lib/python3.9/dist-packages/_awscrt.cpython-39-aarch64-linux-gnu.so(+0xf4f78) [0xffff9a7ccf78]
/lib/aarch64-linux-gnu/libpthread.so.0(+0x8628) [0xffffa608b628]
/lib/aarch64-linux-gnu

Hi @jmklix,
Another example of the error from over the weekend, not sure if it is possible to understand what is happening from this:

Signal received: 1852270931, errno: 1914727521
################################################################################
Stack trace:
################################################################################
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(aws_backtrace_print+0x4c) [0xffff77ca5ec0]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0x718b8) [0xffff77c198b8]
linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff834945c0]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(aws_backtrace_print+0x28) [0xffff77ca5e9c]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0x718b8) [0xffff77c198b8]
linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff834945c0]
/usr/bin/python3() [0x4bf13c]
/usr/bin/python3(PyUnicode_FromFormatV+0x748) [0x4ba628]
/usr/bin/python3(PyErr_Format+0xb8) [0x4beba8]
/usr/bin/python3(_PyObject_MakeTpCall+0x694) [0x4a8304]
/usr/bin/python3(_PyObject_CallFunction_SizeT+0x254) [0x4daa04]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0x730a8) [0xffff77c1b0a8]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0x84ec0) [0xffff77c2cec0]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0xd68c4) [0xffff77c7e8c4]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0xdb0d4) [0xffff77c830d4]
/usr/local/lib/python3.8/dist-packages/_awscrt.cpython-38-aarch64-linux-gnu.so(+0xfe7b8) [0xffff77ca67b8]
/lib/aarch64-linux-gnu/libpthread.so.0(+0x8628) [0xffff8343b628]
/lib/aarch64-linux-gnu/libc.so.6(+0xd4e8c) [0xffff8325fe8c]
torizon@CB20300086:~$ [192855.878981] systemd-journald[494]: Failed to create new runtime journal: Interrupted system call
[192855.888053] systemd-journald[494]: Failed to open user journal directory '/var/log/journal/91b2c61e50bb4e7da5945ec3b1f7a0a1/': No such file or directory

Hi @jmklix and @TwistedTwigleg,
I think I have figured out the problem. The destroyed message occurs when I recreate the mqtt connection object and a packet is already in-flight. This happens sometimes as my system comes out of sleep. I think this is also the cause of crash which occurs after a number of these messages.
I have resolved the problem by persisting the mqtt connection object through sleep so I no longer recreate it when the device wakes up. I no longer get the destroyed messages or the crash.
Any thoughts on why this occurs? I did not think it would be a problem to recreate this object periodically.
Thanks

Ed

Hi @edcloudcycle
I'm glad you were able to find what was causing the issue and a solution for it!

If you don't mind me asking:

  • By the system coming out of sleep, do you mean the sleep function in Python or the OS being put in a sleep/suspended state?
  • How are you recreated the mqtt connection object?
    • Are you reassigning the original client to the new client in your code, like client = new_client for example?
    • Are you disconnecting the mqtt client prior to reassigning it?
  • Can you share the code snippet you were using for recreating the mqtt client?

Generally, it is recommended to keep the mqtt connection associated with operations (publish, subscribe, etc) alive if you have in-flight messages, as the acks you receive can then be mapped to that mqtt client and processed as needed. Taking a quick look at the relevant parts of the code, I think what might be happening is that when you reassign the mqtt client, it frees the native memory for that client (the C code stuff), and then when it tries to process the future for the in-flight message, it points to freed memory and crashes.

That is my guess right now, but I would need to try and recreate the problem to know for sure.

Hi @TwistedTwigleg,

  • The python application stops all its threads and powers down associated hardware it then calls the OS sleep and wakes up on an RTC alarm. When it wakes all threads are restarted and retain some state from before as required.
  • Reassigning, see code below, the self._mqtt_connection object persists during sleep.
  • I do disconnect before sleeping. just calling disconnect()

This function was being called when waking from sleep and caused the destroyed message when a packet was in flight. Posted as an image as the code formatting seems to mangle python.
image

Thanks @edcloudcycle! I will try and see if I can reproduce this issue on my end.

In the meantime, I would highly recommend persisting the MqttConnection until all the in-flight/pending messages are processed before reassigning the MqttConnection to avoid this issue. ๐Ÿ‘

Hey @edcloudcycle,

I've been looking into this and I think everything is working as expected, but it is a bit confusing and it took some tinkering locally to get it all sorted out on what is going on. I have attached the Python file I used (a modified version of the PubSub sample) and I'll try to do my best to explain, but if anything is unclear, please do not hesitate to ask for more clarification.

Modified version of the PubSub sample: pubsub.py.zip. Note that it needs to be run in the samples folder of the V2 SDK to work due to needing command_line_utils.py.

Something I should add right away though, is that I was not able to reproduce the crash you are seeing when you push the system a bit harder. It might be I do not have the right setup, or it might be something else, but I was not able to get it to happen even when increasing all the operations.


First, the error you are seeing on the publish callback happens because the MQTT connection is disconnected and destroyed, and so any in-flight messages will never be able to have a callback called on them, and so the publish is a "failure". Now, the publish may have actually gone out to IoT Core, but because QOS1 requires an ACK back from the server to say it has gotten the message, and the MQTT connection has not gotten that ACK yet (hence it is in-flight), the client has to assume the publish failed.
The reason it has to assume it is because if you are disconnected and the MQTT connection is completely destroyed (more on this in a moment), there is no way it can get an ACK and respond, as there is nothing to get it anymore in memory. To show this, you get the AWS_ERROR_MQTT_CONNECTION_DESTROYED callback because there is no client connection anymore at all, it is completely destroyed.

Now, the tricky and confusing bit is that you are not destroying the connection directly, you are reassigning it to a new connection. However, from Python's and the MQTT connection's perspective, it has NO IDEA there is going to be a new connection recreated, it just sees disconnect and then destroy (via the reassign) and so it goes "there is no way I can get these now, so fail them!" and emits the AWS_ERROR_MQTT_CONNECTION_DESTROYED on the callback. Python will set the "old" connection's ref-count to 0, and it will be freed and the variable that was pointing to this freed connection will now point to the new recreated one.

Now, the reason you do NOT get this on the disconnect itself, and it is instead occurring on the reassignment of the connection, is because you have clean_start set to False. When clean_start is set to False, if you reconnect to the server you have a potential to resume your session (assuming you connect soon enough, as configured by the settings when you make the connection object) and if you resume your session, the client will resend the QOS1 publish and/or get the original ACK back. In other words, with clean_start set to false, there is a possibility any in-flight messages will be properly recovered if you reconnect.

On the other hand though, if you use QOS0, where you send a publish but do NOT have to get an ACK back from the server confirming it has gotten the message, then you can disconnect and reassign (or just clear) the MQTT Connection and it will not have any exceptions because in QOS0, there is not a concept of in-flight messages because it doesn't NEED a response from the server.

Inversely, if you have clean_start set to true and then disconnect, you will get the AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION immediately as soon as you disconnect. This is because with clean_start set to true, when you reconnect it will be treated as a completely new session, and so any previously in-flight messages are canceled on this disconnect because, when you reconnect, you are not resuming/picking-up, but instead are starting completely fresh. It is a different error, but same result ultimately because in both cases (clean_start being true, or deleting the MQTT connection via reassign or other), the new connection is completely fresh and unaware of what happened in the prior connection.


Hopefully that made sense! It is confusing and I had to recreate what was going on to have a more precise idea on what was happening and why. ๐Ÿ˜…
Again, if anything does not make sense, please let me know and I'll do my best to clarify! ๐Ÿ™‚

To summarize: the AWS_ERROR_MQTT_CONNECTION_DESTROYED exception is sent in a future whenever an in-flight message is in a state where the MQTT connection believes it cannot be recovered, either because the connection is completely destroyed (therefore nowhere for it to go) or because clean_start is true (and therefore a completely new session is made on re-connect anyway).


There are a few possible solutions you can implement in your code though to help with this issue and respond to it when it occurs:

The first possible solution is to put the result.get() call in the publish callback inside a try-except block in Python and, when the error is AWS_ERROR_MQTT_CONNECTION_DESTROYED, handle the situation knowing that all the MQTT client is trying to say is that the in-flight message cannot be received.
This is the solution I would recommend, as I believe it will give you the most flexibility and allow you to respond to when in-flight messages are canceled due to the reasons I mentioned above, which can be helpful should you need to add code to respond to these situations.

A second possible solution is to call disconnect prior to putting the device to sleep, like you are currently, and then just call connect once the device is awake again, so never recreating the connection at all. Do note that clean_start needs to be false. You should not get the AWS_ERROR_MQTT_CONNECTION_DESTROYED if you just reconnect using connect rather than making a new MQTT connection via a reassign.

Hi @TwistedTwigleg,
Thanks so much for the detailed response that's very helpful.
I think I have arrived at your suggestions independently and my new build is working well. For the crash I had to get the destroyed message a lot so 100+ times over a few days before it occured.
I added a catch for the destroyed message and rescue the packet like this:
image
Although I no longer see this message because I now no longer recreate the connection when waking from sleep. I agree it just seems unnesscessary.

โš ๏ธCOMMENT VISIBILITY WARNINGโš ๏ธ

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.