SupRsync docker on satp2 smurf servers crashes due to QueuePool limit

Question

SupRsync docker on satp2 smurf servers crashes due to QueuePool limit

Opened this issue 6 months ago · 1 comments

We're seeing these failure messages on both smurf-srv19 and smurf-srv21.

2024-08-28T13:41:12+0000 run:0 CRASH: [Failure instance: Traceback: <class 'sqlalchemy.exc.TimeoutError'>: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/20/3o7r)
/usr/lib/python3.10/threading.py:1016:_bootstrap_inner
/usr/lib/python3.10/threading.py:953:run
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_threadworker.py:49:work
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:269:inContext
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:285:<lambda>
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:117:callWithContext
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:82:callWithContext
/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:984:_running_wrapper
/opt/venv/lib/python3.10/site-packages/socs/agents/suprsync/agent.py:198:run
/opt/venv/lib/python3.10/site-packages/socs/db/suprsync.py:707:delete_files
/opt/venv/lib/python3.10/site-packages/socs/db/suprsync.py:391:get_deletable_files
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py:2673:all
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py:2827:_iter
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:2351:execute
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:2226:_execute_internal
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:2095:_connection_for_bind
<string>:2:_connection_for_bind
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py:139:_go
/opt/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py:1189:_connection_for_bind
/opt/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:3276:connect
/opt/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:146:__init__
/opt/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py:3300:raw_connection
/opt/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py:449:connect
/opt/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py:1263:_checkout
/opt/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py:712:checkout
/opt/venv/lib/python3.10/site-packages/sqlalchemy/pool/impl.py:168:_do_get
]
2024-08-28T13:41:12+0000 run:0 Status is now "done".

Answer 1 · 2024-08-30T21:37:07.000Z

After investigating, I'm thinking this may be due to the sleep-time of the main loop in the suprsync agent. Right now, most agents are running with a 5 second looptime, and every 5 seconds running something like 15 sql queries. The 5 sec sleep time was mainly used for testing and debugging, but in operation this can be increased to something like 60 sec without any issues. I tried increasing the sleeptime for the sat-uhf agents, so I'm hoping that alleviates the issue.