Dynatrace/OneAgent-SDK-for-Python

Incomplete traces with diagnostic message "Some data could not be collected or transmitted. " with asynchronous calls

o-khytrov opened this issue · 1 comments

Description:
When transitioning a FastAPI service to asynchronous calls, we encountered an issue with Dynatrace tracing. Previously, in synchronous mode, traces were complete without any missing information. However, after switching to asynchronous calls, some traces contain the diagnostic message

Some data could not be collected or transmitted. This is most likely due to a resource congestion on network, host or process level in your monitored environment. (Error code: C1)

these traces lack some information like custom request attributes, status code, response time.

Details:

  • The service is instrumented with the OneAgent SDK for Dynatrace.
  • Same issue occurs with autodynatrace package https://github.com/dynatrace-oss/OneAgent-SDK-Python-AutoInstrumentation which internally uses OneAgent
  • Dynatrace operates within a middleware.
  • Attempts to address the problem using the in_process_link code snippet, as suggested in the documentation, have been made but with no success.
  • While testing with lower request loads, the tracing behaves as expected. However, under higher loads where a single process handles multiple asynchronous requests, the issue becomes apparent.

Code Snippet:

async def handle_post_async(data, request, handler):

    app_info = self.__get_app_info(request)
    tag = request.headers.get(oneagent.common.DYNATRACE_HTTP_HEADER_NAME)
    sdk = oneagent.get_sdk()
    link = sdk.create_in_process_link()
    with sdk.trace_in_process_link(link):
        with sdk.trace_incoming_web_request(app_info,
                                            str(request.url),
                                            request.method,
                                            str_tag=tag) as tracer:

            try:
                def trace_params(params: {}):
                    tracer.add_parameters(params)

                request.state.tracer = trace_params

                result = await handler(data, request)
                response = make_response(result)
                tracer.set_status_code(response.status_code)
            except Exception as e:
                type = e.__class__.__name__
                if type == 'BadRequest':
                    response = self.create_error_response(str(e), 400, request, logging.WARNING)
                elif type == 'ValueError':
                    response = self.create_error_response(str(e), 400, request, logging.ERROR)
                else:
                    response = self.create_error_response(str(e), 500, request, logging.ERROR)

                tracer.set_status_code(response.status_code)

        return response

Expected Behavior:
Traces in Dynatrace should contain all necessary custom request attributes and response time information consistently, regardless of the request load or asynchronous nature of the service.

Hello, the OneAgent SDK is known not to work well with these async patterns. There is some explanation in the README at https://github.com/Dynatrace/OneAgent-SDK-for-Python/blob/master/README.md#tracers as well as the suggestion to try OpenTelemetry instead if you need to trace applications making use of async patterns:

A Tracer instance can only be used from the thread on which it was created.
Whenever you start a tracer, the tracer becomes a child of the previously active tracer
on this thread and the new tracer then becomes the active tracer. You may only end the active tracer.
If you do, the tracer that was active before it (its parent) becomes active again.
Put another way, tracers must be ended in reverse order of starting them
(you can think of this being like HTML tags where you must also close the child tag before you can close the parent tag).
While the tracer's automatic parent-child relationship works very intuitively in most cases,
it does not work with asynchronous patterns, where the same thread handles multiple logically
separate operations in an interleaved way on the same thread. If you need to instrument
such patterns with the SDK, you need to end your tracer before the thread is potentially reused
by any other operation (e.g., before yielding to the event loop). To later continue the trace,
capture an in-process link before and later resume using the in-process link tracer, as explained in
Trace in-process asynchronous execution. This approach is rather awkward and
may lead to complex and difficult to interpret traces. If your application makes extensive use of
asynchronous patterns of the kind that is difficult to instrument with the SDK, consider using
the OpenTelemetry support of Dynatrace instead.