athenad memory leak
Opened this issue · 9 comments
It was using 15% of memory, causing low memory events. I had not rebooted the device in a few days iirc and had uploaded many rlogs.
I tried to attach with memray, but instead the process just crashed so I didn't get any information about this. @deanlee could this be similar to the leak you found in athena previously?
It's hard to identify the exact issue, as the code is unclear and difficult to understand. However, multiple factors could be causing the resource leak. One possibility is that the athenad's implementation might cause garbage collector to accumulate objects over time, leading to memory issues during long runs.
additionally, thread race conditions related to queue and other global variable (like cur_upload_items, cancelled_uploads,...) usage, as well as bad practices like not calling task_done() after queue.get(), could also contribute to issues.
The following issues and fixes address potential resource leaks and long-term accumulation risks:
- Memory Leaks in Python requests: A common issue occurs when requests keep references to response objects in memory unnecessarily, leading to memory leaks by preventing garbage collection. #34101
- Unclosed WebSocket: Not closing the WebSocket after a main loop restart can cause resource leaks by keeping connections open. #34085
- Thread Safety in Upload Handling: Lack of proper synchronization for
upload_queue
,cancelled_uploads
, andcur_upload_items
can lead to race conditions, resulting in thread termination, main loop restarts, and inconsistent or corrupted data, as well as queue bloat. #34084 - Thread Safety in ws_manage: The Python socket object is not thread-safe. A mutex should be used to protect
socket.setoption
whilews_send
orws_recv
are sending or receiving messages, preventing race conditions and ensuring safe multithreaded operations.
This is still happening after uploading files
I see this after uploading rlogs:
lock 120 +47
cell 828 +47
builtin_function_or_method 1685 +43
list 1776 +38
tuple 4715 +30
ReferenceType 2736 +30
Event 75 +15
socket 24 +15
Condition 54 +14
deque 62 +14
method 109 +13
function 7974 +13
SSLContext 14 +8
WebSocket 13 +8
sock_opt 13 +8
frame_buffer 13 +8
continuous_frame 13 +8
SplitResult 14 +8
SSLSocket 14 +8
_SSLSocket 14 +8
The issue may not just be with the upload process—WebSocket instance growth could indicate deeper resource management problems.
@sshane: Do we really need the ws_manage
thread? It doesn’t handle exceptions, and changing socket options while the WebSocket is sending or receiving could cause race conditions or undefined behavior. If socket options need to be set, a better approach would be to close the connection, reconnect, and configure the socket before restarting ws_recv
/ws_send
threads.Also, storing sock = ws.sock at the start of the thread and assuming it's valid throughout the loop may not be safe.
It may not be directly related to this issue, but I recommend removing this thread until we have a more robust solution.
The current athenad implementation lacks protections for concurrent data access in a multi-threaded environment. Besides PR #34084, which fixes race conditions for global variables, the UploadQueueCache
class has multi-threading issues. It is accessed in both uploadFilesToUrls
and upload_handler
, which run in different threads, but UploadQueueCache is not thread-safe.
@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.
Commands are placed in the recv_queue
and processed by multiple jsonrpc_handler
threads, with responses added to the send_queue
after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.
Is this behavior by design, or should it be considered an issue?
@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.
Commands are placed in the
recv_queue
and processed by multiplejsonrpc_handler
threads, with responses added to thesend_queue
after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.Is this behavior by design, or should it be considered an issue?
The order of responses does not matter if handled by the backend server correctly since the JsonRPC spec handles this. The 'requester' sends a request with a unique ID which could just be a ns timestamp for example. When the 'responder' (athena in this case) responds, it will also send the ID in the payload. https://www.jsonrpc.org/specification#response_object. This is done automatically so its opaque if you just look at the athena code.
Commas connect backend server should be doing the generation of pseudo unique ids and buffering the responses in a hashmap. Although as seen here https://github.com/commaai/connect/blob/79880f1203269eb145bc87863808e6cfca797ed1/src/actions/files.js#L147
it looks like the id is always zero so it might be worth double checking the backend code that it handles this properly.