spandex-project/spandex

Implement trace sampling, and perhaps emergency load shedding

Opened this issue · 10 comments

We want to make sure that Spandex can be used with large scale implementations, and to do so we need to ensure that sampling is implemented natively.

The datadog agent samples before sending to datadog, but it should be possible to configure sampling in application as well.

There could be configurable latency threshold and based on that just longer requests sent to agent

Quick feedback on https://medium.com/@gmefford/diving-into-distributed-tracing-ce9638025576.

Datadog APM uses the traces to measure aggregate service metrics, which would be incorrectly counted when using priority sampling with distributed traces.

I think this is not quite accurate. The tracer can count the total number of traces (pre sampling) and send it as a header which allows the agent to scale the metrics generated from the sampled traces accordingly.

When possible, I'd suggest alternate solutions like sending traces to the agent more frequently instead of lowering the sampling rate. (In several tracers, we opt to serialize the traces individually and monitor the aggregate size. When the total exceeds a threshold we flush it, even if that means sending more than once per second.)

Thanks for pointing out that header, @tylerbenson! I was aware that there are some things you can pass on to the DD Agent to tell it about your sampling rate, but I haven't invested the time yet to figure out how to make it reliable. It's definitely something I'd like to have working in the future for our Datadog exporter (either in Spandex or in the upcoming OpenTelemetry integrations with the OpenCensus project).

My concern with what you're describing, though, is that in order to count the number of traces we didn't send, we have to somehow still do part of the work of tracing so that we have enough of a trace that we can propagate a trace context from upstream callers to downstream calls, and also maintain (per-service-per-resource?) metrics about un-sampled traces so that we can pass those count to the agent when we eventually do sample and send a trace. That seems possible, but complicated to implement properly. In general, I feel like it's important to be able to sample your traces if you need to, but still get accurate time-series metrics. Ideally, there would be a way to do that "for free," but there are always other solutions like Statix or Telemetry.Metrics to do this on your own, in whatever way works for your team.

The issue that we've seen in practice is that the way Spandex sends traces by default (as I mentioned in the blog post) is with calls to HTTPoison, which end up using Hackney's default pool. Under heavy load, it's easy to fill that up with trace-flushing transactions so that the system can't get any other HTTP calls through. That actually might not be a big problem for a "standard web application" that is mostly serving database requests, but it quickly becomes problematic for an "API gateway" type service, whose job is mostly to transform a HTTP request into a few downstream HTTP calls and process the results before returning them back upstream.

In this case, making more small HTTP calls more often would actually make things worse, I think. In the end, it's possible to work around the issue by not using Hackney's default pool, or by substituting your own HTTP client that doesn't use Hackney at all. I'm also interested to see whether using Unix sockets could help out somehow in this area.

@GregMefford I don't think that you have to keep track of how many traces you don't sample. I think you just send along the chance that a trace would be sampled whenever you send a trace. I think you do it by resource or globally, I forget the details. Lets say 10% is your sampling rate. When you start a trace, you just do a 1/10 probability check, and if it succeeds then you continue with the trace, but include a piece of metadata or a header saying "there was a 10% chance for this", and it can extrapolate that you skipped 9 for each one you sent. Which will successfully scale the metrics.

Right, I think what you’re describing is one option on the API where you can hint the sampling rate, but I suspect there’s also a counter thing like @tylerbenson is describing. The reason is that with a distributed trace, you can’t control the sampling rate for yourself as a service, when upstream services can force you to sample things that you normally wouldn’t have (by propagated context), according to your local sampling probability. In that case, you can’t simply multiply your traces by your inverse sampling rate, because that would exaggerate your trace counts.

but wouldn't the calling service also pass along the sample rate for that specific request? So that is what you would propagate, instead of the called service's sampling rate. Look at it this way:

Service A has a sampling rate of 50%.
Service B has a sampling rate of 10%.

Both services can be called independently, but service A also calls service B in its operation.

10 requests go to service B, and 1 gets to the collector, with a sampling rate of 10%, so the collector can extrapolate that it was 10 requests.

10 requests go to service A, and 5 get to the collector. Because service A called service B, service B is forced to trace all 5 of the requests that it got as a downstream effect. But service A passed along its sampling rate, so service B sends all 5 to the collector, along with the sampling rate provided by service A. So the collector sees 5 requests to each service, with a sampling rate of 50%.

Using this information, it can correctly extrapolate that 20 requests went to service B, and 10 requests went to service A.

If it doesn't work that way then I think it should ¯\(ツ)

I don't think there's any way to propagate sampling rates to downstream services; only whether or not to sample a particular trace (the priority). Each service would send its sampling rate to the collector along with each batch of traces. Even if there were a way to propagate sampling rates downstream, it would get complicated really fast if you had a chain of calls through a dozen services that may or may not make certain down-stream calls based on the data payloads of the requests.

Maybe it is possible that the trace collector could work it out statistically based on traces that it actually sees and what each service is saying that its sampling rate is, but what I've seen in practice is that it struggles to even stitch the traces together reliably when you are sending a lot of them, so my guess is that any time the inter-service trace linkage gets broken, it would mess up these counts pretty significantly.

Yeah, I see your point. Like maybe in my example service A only calls service B some of the time. Makes sense.

@GregMefford I would strongly encourage separating the trace reporting into a different pool to avoid possible problems. I'd also suggest setting a very limited timeout on the connection/request when sending data to the agent (some of our language clients set a 1-2 second limit).