DataDog/dd-trace-py

Celery task_postrun is called only after the span has finished in chains (due to a regression)

Closed this issue · 11 comments

#10676 introduced a regression which causes the task_postrun signal to never be able to attach tags to a span when working with Celery canvases (chains, callbacks, chords etc).

When working with a chain for instance, the task always calls apply_async on its callback just before it finishes (before task_postrun is fired). With the changes introduced in #10676, the span is already finished at this point (and sent to the tracer), so adding new tags doesn't do anything, making the span degenerated.

This means that the following happens:

  1. task_prerun is called, creating a new span, and registering it under prerun_span
  2. The task is run (either successfully or not)
  3. Celery sees that there's a callback registered, and fires the next task in the chain
  4. Task.apply_async is invoked, but it's wrapped with _traced_apply_async_inner. Inside its finally block it sees that there's a prerun_span belonging to the worker span - therefore forcefully finishing it
  5. task_postrun is fired, and tries to attach tags to the span, but at this point it's doomed, since the span has already finished.

All in all I would recommend to revert #10676 and come up with a better implementation, or at least to remove the task_postrun part of it, since it's utterly broken.

Reproduce

tasks.py:

from celery import Celery, chain

app = Celery("tasks")


@app.task
def a():
    print("a")


@app.task
def b():
    print("b")

fire.py:

from tasks import a, b

(a.si() | b.si()).delay()

Run Celery with:

DD_TRACE_ENABLED=1 ddtrace-run celery -A tasks.app worker -l debug -P solo -c 1

Then fire the chain with:

python3 fire.py

You'll see the following debug logs in the Celery worker:

[2024-11-21 14:43:11,621: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] received
...
prerun signal start task_id=8d17b2e1-421d-410f-a81b-a49663fba490
...
The task_postrun signal was not called, so manually closing span: name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=None duration=None error=0 tags={'component': 'celery', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, 'process_id': 48412} links='' events=''
...
finishing span name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=1732192991.624561 duration=0.002615 error=0 tags={'_dd.base_service': '', '_dd.p.dm': '-0', '_dd.p.tid': '673f2adf00000000', 'component': 'celery', 'language': 'python', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, '_dd.top_level': 1, '_dd.tracer_kr': 1.0, '_sampling_priority_v1': 1, 'process_id': 48412} links='' events='' (enabled:True)
...
[2024-11-21 14:43:11,625: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] succeeded in 0.003730249998625368s: None
...
postrun signal task_id=8d17b2e1-421d-410f-a81b-a49663fba490

And you'll see that the first span doesn't have any celery.* tags, for example celery.state.

Thank you for the report and sorry for the behavior that the change had on your spans. The update was to fix a bug where we weren’t closing spans. I really appreciate the level of detail you provided on how to reproduce and your suggestion!

I will be taking a look to discuss with my team the best way to safely get the experience you’re looking for.

As a workaround - are you ok with pinning your version for now?

Apologies for the inconvenience!

@wantsui Yes, we have pinned the version to 2.14.1 for the time being. Thanks for the quick response!

Hello @wantsui . I am wondering if there's any update on this. This issue also affects our organization.

@iherasymenko - Thanks for your report! This is something I am working on!

As an update, I was able to get the same behavior when I run the code snippet in their own apps as described in this specific thread (11479), but I had some issues trying to get the dd-trace-py tests to replicate this exact behavior. This is more on how the Celery tests are set up though.

I'm reviewing your report in #11624 as part of the consideration for the fix. I'll report back once I have confidence that our tests correctly capture this scenario to avoid any issues.

I'll update both issues once I have another update.

The following is a workaround for this problem:

from celery import signals
from wrapt import wrap_function_wrapper

@signals.worker_init.connect
def workaround_ddtrace_bug(**_):
    def set_item_wrapper(wrapped, instance, args, kwargs):
        if args[0] == "prerun_span":
            return
        return wrapped(*args, **kwargs)

    wrap_function_wrapper(
        "ddtrace.internal.core",
        "set_item",
        set_item_wrapper,
    )

It neutralizes the registering of the current span under prerun_span, so _traced_apply_async_inner won't prematurely close it.

As an update, I got the approvals I needed to get this merged. I'm also working to backport this fix to 2.18.0, 2.17.0, and 2.16.0, would any of these versions for your use case reported here, @oranav ?

I need it on 18

Thanks for letting me know, @esauser! I'm working on it over the next two days and reply to this issue once it's ready.

@wantsui Thanks. We're currently on 2.17.

As an update, we got 2.18.1 out with the celery update: https://github.com/DataDog/dd-trace-py/releases/tag/v2.18.1 . Can you give that a try?

Unfortunately, we can't release patches for 2.17.x or 2.16.x due to the holidays and code freezes, so that'll be out in January instead. I'll update this thread once we have those other versions out in January. Apologies for the inconvenience!

Closing due to inactivity. Feel free to reopen.