commaai/openpilot

athenad memory leak

Opened this issue · 9 comments

It was using 15% of memory, causing low memory events. I had not rebooted the device in a few days iirc and had uploaded many rlogs.

I tried to attach with memray, but instead the process just crashed so I didn't get any information about this. @deanlee could this be similar to the leak you found in athena previously?

It's hard to identify the exact issue, as the code is unclear and difficult to understand. However, multiple factors could be causing the resource leak. One possibility is that the athenad's implementation might cause garbage collector to accumulate objects over time, leading to memory issues during long runs.
additionally, thread race conditions related to queue and other global variable (like cur_upload_items, cancelled_uploads,...) usage, as well as bad practices like not calling task_done() after queue.get(), could also contribute to issues.

The following issues and fixes address potential resource leaks and long-term accumulation risks:

  1. Memory Leaks in Python requests: A common issue occurs when requests keep references to response objects in memory unnecessarily, leading to memory leaks by preventing garbage collection. #34101
  2. Unclosed WebSocket: Not closing the WebSocket after a main loop restart can cause resource leaks by keeping connections open. #34085
  3. Thread Safety in Upload Handling: Lack of proper synchronization for upload_queue, cancelled_uploads, and cur_upload_items can lead to race conditions, resulting in thread termination, main loop restarts, and inconsistent or corrupted data, as well as queue bloat. #34084
  4. Thread Safety in ws_manage: The Python socket object is not thread-safe. A mutex should be used to protect socket.setoption while ws_send or ws_recv are sending or receiving messages, preventing race conditions and ensuring safe multithreaded operations.

This is still happening after uploading files

I see this after uploading rlogs:

lock                            120       +47
cell                            828       +47
builtin_function_or_method     1685       +43
list                           1776       +38
tuple                          4715       +30
ReferenceType                  2736       +30
Event                            75       +15
socket                           24       +15
Condition                        54       +14
deque                            62       +14
method                          109       +13
function                       7974       +13
SSLContext                       14        +8
WebSocket                        13        +8
sock_opt                         13        +8
frame_buffer                     13        +8
continuous_frame                 13        +8
SplitResult                      14        +8
SSLSocket                        14        +8
_SSLSocket                       14        +8

The issue may not just be with the upload process—WebSocket instance growth could indicate deeper resource management problems.

@sshane: Do we really need the ws_manage thread? It doesn’t handle exceptions, and changing socket options while the WebSocket is sending or receiving could cause race conditions or undefined behavior. If socket options need to be set, a better approach would be to close the connection, reconnect, and configure the socket before restarting ws_recv/ws_send threads.Also, storing sock = ws.sock at the start of the thread and assuming it's valid throughout the loop may not be safe.

It may not be directly related to this issue, but I recommend removing this thread until we have a more robust solution.

The current athenad implementation lacks protections for concurrent data access in a multi-threaded environment. Besides PR #34084, which fixes race conditions for global variables, the UploadQueueCache class has multi-threading issues. It is accessed in both uploadFilesToUrls and upload_handler, which run in different threads, but UploadQueueCache is not thread-safe.

@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.

Commands are placed in the recv_queue and processed by multiple jsonrpc_handler threads, with responses added to the send_queue after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.

Is this behavior by design, or should it be considered an issue?

@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.

Commands are placed in the recv_queue and processed by multiple jsonrpc_handler threads, with responses added to the send_queue after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.

Is this behavior by design, or should it be considered an issue?

The order of responses does not matter if handled by the backend server correctly since the JsonRPC spec handles this. The 'requester' sends a request with a unique ID which could just be a ns timestamp for example. When the 'responder' (athena in this case) responds, it will also send the ID in the payload. https://www.jsonrpc.org/specification#response_object. This is done automatically so its opaque if you just look at the athena code.

Commas connect backend server should be doing the generation of pseudo unique ids and buffering the responses in a hashmap. Although as seen here https://github.com/commaai/connect/blob/79880f1203269eb145bc87863808e6cfca797ed1/src/actions/files.js#L147
it looks like the id is always zero so it might be worth double checking the backend code that it handles this properly.