The data dog agent encountered this error， traces_dropped(payload_too_large:466059)

Question

The data dog agent encountered this error， traces_dropped(payload_too_large:466059)

Opened this issue 5 months ago · 9 comments

What methods do we have to control the size of sending packets?

Answer 1 · 2024-07-12T22:04:06.000Z

Could you provide more details, such as the runtime environment, the connection with dd-trace-cpp, and how frequently this issue occurs? Additionally, to ensure we're on the same page, could you explain what the Datadog proxy is?

Answer 2 · 2024-07-12T22:32:36.000Z

My guess is the Datadog proxy is the Datadog Agent, and "payload too large" refers to this behavior in the Agent.

Looks like the default limit is 25 MB, which is an awful lot of traces.

Answer 3 · 2024-07-12T22:53:09.000Z

yes Datadog proxy mean Datadog Agent,We have a service that can process approximately 19000 requests per second，Now I am using a global tracer，I understand that it only creates an HTTP client to send span to the agent, right？

Answer 4 · 2024-07-12T22:57:00.000Z

Now when our load reaches 10000/sec, this error will occur

Answer 5 · 2024-07-13T08:57:13.000Z

Damien still needs to know which integration you're using, i.e. NGINX, Envoy, Istio, etc.

As a workaround, you can tell the tracing library to send payloads to the Agent more often, but that option does not have a corresponding environment variable. So, that would apply only if you're using dd-trace-cpp manually in C++ code.

Answer 6 · 2024-07-16T08:28:14.000Z

Our application is implemented based on this example，The only difference is that our HTTP framework is not httplib
, By the way, we have already set it（flush_interval_milliseconds） up in the program, but it still happens， I am now planning to use multiple dd::Tracer
https://github.com/DataDog/dd-trace-cpp/blob/main/examples/http-server/server/server.cpp

Answer 7 · 2024-07-16T14:27:40.000Z

Somebody actually used the example! That's good to hear.

If the large payloads are due to many different traces being included in a flush interval, then reducing flush_interval_milliseconds will help. For example, set it to 200 to send payloads ten times faster than the default (which is 2000). Then payloads will be, on average, ten times smaller. It depends on the traffic pattern, of course.

On the other hand, if the large payloads are due to individual traces that have many spans, then there is nothing you can configure to remedy this. dd-trace-cpp would have to be modified to break up its payloads, which is possible but not implemented.

Answer 8 · 2024-07-16T22:56:01.000Z

Would it help to use multiple dd::tracer, I think multiple tracers would spread the pressure

Answer 9 · 2024-07-17T14:34:31.000Z

By the way, we have already set it（flush_interval_milliseconds） up in the program, but it still happens

What value for flush_interal_milliseconds did you use?

Would it help to use multiple dd::tracer, I think multiple tracers would spread the pressure

I doubt it. It depends on the statistical distributions your application has for "traces per second" and for "spans per trace." If the issue is "traces per second," then decreasing flush_interval_milliseconds is the workaround. If the issue is "spans per trace," then decreasing flush_interval_milliseconds may help, but if your application has individual traces that are each on the order of 25 MB when serialized, there is no present workaround.

Multiple Tracer objects would imply multiple clients sending HTTP requests to the Datadog Agent. I don't see how that would be any better than decreasing flush_interval_milliseconds, and then additionally you'd have to manage which Tracer object to use for a particular service request.

The tracing library keeps track of certain telemetry metrics, but I'm not sure they can be used to infer the "distributions"
I referred to above.