eclipse/paho.mqtt.c

The crash in macOS occurs after the MQTTProtocol_emptyMessageList method.

JyHu opened this issue · 3 comments

Describe the bug
We recently integrated the paho.c framework into our macOS project, but after going live, we encountered multiple crashes in production. The specific stack trace is as follows:

Thread 6 name:  MQTTAsync_send
Thread 6 Crashed:
0   libsystem_kernel.dylib          0x00007ff81f40400e __pthread_kill + 10
1   libsystem_c.dylib               0x00007ff81f385d24 abort + 123
2   libsystem_malloc.dylib          0x00007ff81f263357 malloc_vreport + 551
3   libsystem_malloc.dylib          0x00007ff81f26652b malloc_report + 151
4   libPahoC                        0x00000001124782e4 MQTTProtocol_emptyMessageList + 164
5   libPahoC                        0x0000000112474316 MQTTAsync_cleanSession + 454
6   libPahoC                        0x0000000112474bd9 MQTTAsync_checkDisconnect + 121
7   libPahoC                        0x000000011247125f MQTTAsync_sendThread + 3039
8   libsystem_pthread.dylib         0x00007ff81f43a4e1 _pthread_start + 125
9   libsystem_pthread.dylib         0x00007ff81f435f6b thread_start + 15

The last log message appears after the callback to the client's business code at this line:onPublishFaile. However, due to the lack of additional log information, it's difficult to determine if it's genuinely useful. We have added more logs in the upcoming version, and we will need to wait until it's released to gather information collected from the production environment.

Hope to get your help.

Log files
Please try to attach log files rather than pasting the log contents. It makes the issues easier to read.
backtrace.log

** Environment (please complete the following information):**

  • OS: macos
  • os version: macos 12.0 ~ macos 14.1
  • paho version: main branch at d7b8c17

I would advise using a release not a snapshot from the develop branch at some point in time, when you go into production. There is an important fix that you are missing, which should be better in the 1.3.12 and 1.3.13 releases. Whether this would affect your particular situation I don't know, but I still would advise against.

You could try provoking a publish failure in your testing and see if the crash occurs, and whether your application is doing something at that point to cause it, or not. The broker you are connecting to might have some more information about the publish failure and what you might need to do to cause the situation in a test environment. Exceeding the maximum number of inflight messages, or trying to publish to a topic which the application is not authorized are some examples. You'll probably want some publishes to succeed first.

Thank you for your advice. We will incorporate more detailed logging capabilities in the upcoming version to help identify issues. We will then proceed with further debugging based on your suggestions. Your assistance is greatly appreciated, and if we gather any valuable information in the future, we will promptly provide feedback.

Thanks all the time.

I would advise using a release not a snapshot from the develop branch at some point in time, when you go into production. There is an important fix that you are missing, which should be better in the 1.3.12 and 1.3.13 releases. Whether this would affect your particular situation I don't know, but I still would advise against.

You could try provoking a publish failure in your testing and see if the crash occurs, and whether your application is doing something at that point to cause it, or not. The broker you are connecting to might have some more information about the publish failure and what you might need to do to cause the situation in a test environment. Exceeding the maximum number of inflight messages, or trying to publish to a topic which the application is not authorized are some examples. You'll probably want some publishes to succeed first.

JyHu commented

I would advise using a release not a snapshot from the develop branch at some point in time, when you go into production. There is an important fix that you are missing, which should be better in the 1.3.12 and 1.3.13 releases. Whether this would affect your particular situation I don't know, but I still would advise against.

You could try provoking a publish failure in your testing and see if the crash occurs, and whether your application is doing something at that point to cause it, or not. The broker you are connecting to might have some more information about the publish failure and what you might need to do to cause the situation in a test environment. Exceeding the maximum number of inflight messages, or trying to publish to a topic which the application is not authorized are some examples. You'll probably want some publishes to succeed first.

@icraggs
Because we are compiling this Paho project and placing it into our macOS app's main project for referencing, we are unable to obtain more detailed stack information about the Paho framework during a crash. Consequently, we have set the trace level of Paho to the lowest level (MQTTASYNC_TRACE_MAXIMUM), hoping to generate more log information to assist in troubleshooting. The most recent information in the received logs is as follows:

2023/11/21 03:48:36:411  --> trace 20231121 114836.409 26 macos_default_mqtt_CA83FE22-B921-48B1-BA67-CFEF34F69634 <- PUBLISH msgid: 0 qos: 0 retained: 0 payload len(2928): [{"market":"HK","ind
2023/11/21 03:48:36:411  --> trace 20231121 114836.409 Calling messageArrived for client macos_default_mqtt_CA83FE22-B921-48B1-BA67-CFEF34F69634, queue depth 0
2023/11/21 03:48:37:412  --> trace 20231121 114836.409 Return code 0 from poll
2023/11/21 03:48:38:515  --> trace 20231121 114836.409 Return code 0 from poll
2023/11/21 03:48:39:618  --> trace 20231121 114836.409 Return code 0 from poll
2023/11/21 03:48:40:735  --> trace 20231121 114840.734 Return code 0 from poll
2023/11/21 03:48:49:098  --> trace 20231121 114840.734 sent 771 256 buflen 5
2023/11/21 03:48:49:098  --> trace 20231121 114840.734 sent 771 257 buflen 1
2023/11/21 03:48:49:098  --> trace 20231121 114840.734 27 macos_default_http_041B50E9-6E23-423A-B23E-6650F1C8ED5F -> PUBLISH msgid: 2 qos: 1 retained: 0 rc 0 payload len(74): {"hkDelay":true,"lit
2023/11/21 03:48:49:108  --> trace 20231121 114840.734 Return code 1 from poll
2023/11/21 03:48:49:109  --> trace 20231121 114840.734 sent 771 256 buflen 5
2023/11/21 03:48:49:109  --> trace 20231121 114840.734 sent 771 257 buflen 1
2023/11/21 03:48:49:110  --> trace 20231121 114840.734 SSLSocket error (5) in SSL_write for socket 27 rc -1 errno 32 Broken pipe
2023/11/21 03:48:49:110  --> trace 20231121 114840.734 27 macos_default_http_041B50E9-6E23-423A-B23E-6650F1C8ED5F -> PUBLISH msgid: 3 qos: 1 retained: 0 rc -1 payload len(74): {"hkDelay":true,"lit
2023/11/21 03:48:49:110  --> trace 20231121 114840.734 Calling command failure for client macos_default_http_041B50E9-6E23-423A-B23E-6650F1C8ED5F
2023/11/21 03:48:49:164  --> trace 20231121 114840.734 SSLSocket error (1) in SSL_write for socket 27 rc -1 errno 0 Undefined error: 0
2023/11/21 03:48:49:164  --> trace 20231121 114840.734 27 macos_default_http_041B50E9-6E23-423A-B23E-6650F1C8ED5F -> DISCONNECT (-1)
2023/11/21 03:48:49:164  --> trace 20231121 114840.734 Removed socket 27

And then, the app crashed.
We have examined all collected Paho logs, and the latest segments in all logs are the same.

Additionally, we suspect a connection with the user's network conditions because in the received logs, we also observe numerous instances of HTTP requests timing out or failing outright. For example:

>> [E] 11/21 11:48:38.8070 -[TBBaseRequestEngine innerExecute:responseLevel:]_block_invoke[179] 请求出错: (null), Error: Error Domain=NSURLErrorDomain Code=-1009 "似乎已断开与互联网的连接。" UserInfo={_kCFStreamErrorCodeKey=50, NSUnderlyingError=0x600000057480 {Error Domain=kCFErrorDomainCFNetwork Code=-1009 "(null)" UserInfo={_kCFStreamErrorDomainKey=1, _kCFStreamErrorCodeKey=50, _NSURLErrorNWResolutionReportKey=Resolved 0 endpoints in 0ms using unknown from cache, _NSURLErrorNWPathKey=unsatisfied (No network route)}}, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask <7C0FE3AA-D75D-4795-A798-D49A6D76F89C>.<4560>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
    "LocalDataTask <7C0FE3AA-D75D-4795-A798-D49A6D76F89C>.<4560>"
), NSLocalizedDescription=似乎已断开与互联网的连接。, NSErrorFailingURLStringKey=<request url>, NSErrorFailingURLKey=<request url>, _kCFStreamErrorDomainKey=1}

We hope to receive more assistance or hints from you.
Thank you very much.