Celery daemon crash due to Oracle error: DPI-1010 no connection
andreasgou opened this issue · 13 comments
There is a serious issue occurring in release 2.2.0 causing Celery daemon to stop unexpectedly.
It seems from the stack trace that is caused during an update fired from django_celery_results backend.
Check out the bold words below.
Thank you
2022-07-01 03:16:23,239 [CRITICAL] celery.worker: Unrecoverable error: DatabaseError(<cx_Oracle._Error object at 0x7f298520c370>)
Traceback (most recent call last):
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/oracle/base.py", line 523, in execute
return self.cursor.execute(query, self._param_generator(params))
cx_Oracle.DatabaseError: DPI-1010: not connected
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/../.virtualenvs/../python3.9/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/../.virtualenvs/../python3.9/site-packages/celery/bootsteps.py", line 365, in start
return self.obj.start()
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/consumer/consumer.py", line 326, in start
blueprint.start(self)
File "/../.virtualenvs/../python3.9/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/consumer/consumer.py", line 618, in start
c.loop(*c.loop_args())
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/loops.py", line 97, in asynloop
next(loop)
File "/../.virtualenvs/../python3.9/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
cb(*cbargs)
File "/../.virtualenvs/../python3.9/site-packages/kombu/transport/base.py", line 235, in on_readable
reader(loop)
File "/../.virtualenvs/../python3.9/site-packages/kombu/transport/base.py", line 217, in _read
drain_events(timeout=0)
File "/../.virtualenvs/../python3.9/site-packages/amqp/connection.py", line 523, in drain_events
while not self.blocking_read(timeout):
File "/../.virtualenvs/../python3.9/site-packages/amqp/connection.py", line 529, in blocking_read
return self.on_inbound_frame(frame)
File "/../.virtualenvs/../python3.9/site-packages/amqp/method_framing.py", line 77, in on_frame
callback(channel, msg.frame_method, msg.frame_args, msg)
File "/../.virtualenvs/../python3.9/site-packages/amqp/connection.py", line 535, in on_inbound_method
return self.channels[channel_id].dispatch_method(
File "/../.virtualenvs/../python3.9/site-packages/amqp/abstract_channel.py", line 143, in dispatch_method
listener(*args)
File "/../.virtualenvs/../python3.9/site-packages/amqp/channel.py", line 1613, in _on_basic_deliver
fun(msg)
File "/../.virtualenvs/../python3.9/site-packages/kombu/messaging.py", line 626, in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message)
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/consumer/consumer.py", line 586, in on_task_received
strategy(
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/strategy.py", line 162, in task_message_handler
if (req.expires or req.id in revoked_tasks) and req.revoked():
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/request.py", line 456, in revoked
self._announce_revoked(
File "/../.virtualenvs/../python3.9/site-packages/celery/worker/request.py", line 438, in _announce_revoked
self.task.backend.mark_as_revoked(
File "/../.virtualenvs/../python3.9/site-packages/celery/backends/base.py", line 272, in mark_as_revoked
self.store_result(task_id, exc, state,
File "/../.virtualenvs/../python3.9/site-packages/celery/backends/base.py", line 528, in store_result
self._store_result(task_id, result, state, traceback,
File "/../.virtualenvs/../python3.9/site-packages/django_celery_results/backends/database.py", line 66, in _store_result
self.TaskModel._default_manager.store_result(
File "/../.virtualenvs/../python3.9/site-packages/django_celery_results/managers.py", line 46, in _inner
return fun(*args, **kwargs)
File "/../.virtualenvs/../python3.9/site-packages/django_celery_results/managers.py", line 168, in store_result
obj, created = self.using(using).get_or_create(task_id=task_id,
File "/../.virtualenvs/../python3.9/site-packages/django/db/models/query.py", line 581, in get_or_create
return self.get(**kwargs), False
File "/../.virtualenvs/../python3.9/site-packages/django/db/models/query.py", line 431, in get
num = len(clone)
File "/../.virtualenvs/../python3.9/site-packages/django/db/models/query.py", line 262, in len
self._fetch_all()
File "/../.virtualenvs/../python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/../.virtualenvs/../python3.9/site-packages/django/db/models/query.py", line 51, in iter
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/../.virtualenvs/../python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
cursor.execute(sql, params)
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/../.virtualenvs/../python3.9/site-packages/django/db/utils.py", line 90, in exit
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/../.virtualenvs/../python3.9/site-packages/django/db/backends/oracle/base.py", line 523, in execute
return self.cursor.execute(query, self._param_generator(params))
django.db.utils.DatabaseError: DPI-1010: not connected
while the recent release probably fix this https://github.com/celery/django-celery-results/releases/tag/v2.4.0
while the recent release probably fix this https://github.com/celery/django-celery-results/releases/tag/v2.4.0
That was my hope too, unfortunately, there is #326 bug in task_results where fields are missing (task_name, worker, etc) and I can't deploy. So. I had to roll-back to 2.3.1
If this fix is included in 2.3.1, then I'm lucky !!
Is it?
you have to check releases yourselves
Sorry about that, tried though but I only see the commit number, not the release tag (not so experienced with github)
#326 (comment) this do the trick
#326 (comment) this do the trick
Life savior !
will push a bug fix release soon
Hello again,
unfortunately the issue with cx_Oracle.DatabaseError: DPI-1010: not connected
has not been resolved in version 2.4.0.
I can't reproduce the error in my lab and it is critical since it causes Celery worker daemon to stop on production environments. It is probably due to scheduled DB maintenance the IT performs, (out of our hands).
According to the docs, DPI-1010 is thrown when the system tries to use a closed connection, I believe this can be handled by the module. The purpose of handling is to prevent Celery daemon from shutting down, since these errors can happen at any time.
What are your thoughts on this?
A thorough inspection in firewall logs revealed a broken TCP connection at a specific time of day (midnight UTC) while trying to access the DB and this causes the DB connection to fail.
This is probably due to scheduled maintenance (log rotations, or whatever happens at that time in the network).
However, this error is propagated to Celery worker daemon, causing an unexpected failure and a brutal stop of service, without any error or other indication in syslog except this DPI-1010.
As a workaround, we use supervisord to restart Celery worker automatically. Do you think the patch you applied could make any difference?
Thanks
the patch was applied for a different purpose. then it is more of a network and operational issue? in that case we have to figure out better resilient approach