Celery task_postrun is called only after the span has finished in chains (due to a regression)
Closed this issue · 11 comments
#10676 introduced a regression which causes the task_postrun
signal to never be able to attach tags to a span when working with Celery canvases (chains, callbacks, chords etc).
When working with a chain for instance, the task always calls apply_async
on its callback just before it finishes (before task_postrun
is fired). With the changes introduced in #10676, the span is already finished at this point (and sent to the tracer), so adding new tags doesn't do anything, making the span degenerated.
This means that the following happens:
task_prerun
is called, creating a new span, and registering it underprerun_span
- The task is run (either successfully or not)
- Celery sees that there's a callback registered, and fires the next task in the chain
Task.apply_async
is invoked, but it's wrapped with_traced_apply_async_inner
. Inside itsfinally
block it sees that there's aprerun_span
belonging to the worker span - therefore forcefully finishing ittask_postrun
is fired, and tries to attach tags to the span, but at this point it's doomed, since the span has already finished.
All in all I would recommend to revert #10676 and come up with a better implementation, or at least to remove the task_postrun
part of it, since it's utterly broken.
Reproduce
tasks.py:
from celery import Celery, chain
app = Celery("tasks")
@app.task
def a():
print("a")
@app.task
def b():
print("b")
fire.py:
from tasks import a, b
(a.si() | b.si()).delay()
Run Celery with:
DD_TRACE_ENABLED=1 ddtrace-run celery -A tasks.app worker -l debug -P solo -c 1
Then fire the chain with:
python3 fire.py
You'll see the following debug logs in the Celery worker:
[2024-11-21 14:43:11,621: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] received
...
prerun signal start task_id=8d17b2e1-421d-410f-a81b-a49663fba490
...
The task_postrun signal was not called, so manually closing span: name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=None duration=None error=0 tags={'component': 'celery', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, 'process_id': 48412} links='' events=''
...
finishing span name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=1732192991.624561 duration=0.002615 error=0 tags={'_dd.base_service': '', '_dd.p.dm': '-0', '_dd.p.tid': '673f2adf00000000', 'component': 'celery', 'language': 'python', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, '_dd.top_level': 1, '_dd.tracer_kr': 1.0, '_sampling_priority_v1': 1, 'process_id': 48412} links='' events='' (enabled:True)
...
[2024-11-21 14:43:11,625: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] succeeded in 0.003730249998625368s: None
...
postrun signal task_id=8d17b2e1-421d-410f-a81b-a49663fba490
And you'll see that the first span doesn't have any celery.*
tags, for example celery.state
.
Thank you for the report and sorry for the behavior that the change had on your spans. The update was to fix a bug where we weren’t closing spans. I really appreciate the level of detail you provided on how to reproduce and your suggestion!
I will be taking a look to discuss with my team the best way to safely get the experience you’re looking for.
As a workaround - are you ok with pinning your version for now?
Apologies for the inconvenience!
@wantsui Yes, we have pinned the version to 2.14.1 for the time being. Thanks for the quick response!
@iherasymenko - Thanks for your report! This is something I am working on!
As an update, I was able to get the same behavior when I run the code snippet in their own apps as described in this specific thread (11479), but I had some issues trying to get the dd-trace-py tests to replicate this exact behavior. This is more on how the Celery tests are set up though.
I'm reviewing your report in #11624 as part of the consideration for the fix. I'll report back once I have confidence that our tests correctly capture this scenario to avoid any issues.
I'll update both issues once I have another update.
The following is a workaround for this problem:
from celery import signals
from wrapt import wrap_function_wrapper
@signals.worker_init.connect
def workaround_ddtrace_bug(**_):
def set_item_wrapper(wrapped, instance, args, kwargs):
if args[0] == "prerun_span":
return
return wrapped(*args, **kwargs)
wrap_function_wrapper(
"ddtrace.internal.core",
"set_item",
set_item_wrapper,
)
It neutralizes the registering of the current span under prerun_span
, so _traced_apply_async_inner
won't prematurely close it.
As an update, I got the approvals I needed to get this merged. I'm also working to backport this fix to 2.18.0, 2.17.0, and 2.16.0, would any of these versions for your use case reported here, @oranav ?
I need it on 18
Thanks for letting me know, @esauser! I'm working on it over the next two days and reply to this issue once it's ready.
As an update, we got 2.18.1 out with the celery update: https://github.com/DataDog/dd-trace-py/releases/tag/v2.18.1 . Can you give that a try?
Unfortunately, we can't release patches for 2.17.x or 2.16.x due to the holidays and code freezes, so that'll be out in January instead. I'll update this thread once we have those other versions out in January. Apologies for the inconvenience!
Closing due to inactivity. Feel free to reopen.