spandex-project/spandex

Spandex processes dying

Closed this issue · 5 comments

I am a bit at a loss where to start debugging this, so this is a bit of a broad ask for help. It seems that when we put some load on the Spandex GenServer it seems to crash from time to time causing spans to fail hundreds of thousands of times with messages like:

GenServer.call(#PID<0.934.1>, {:update, #Function<18.39421655/1 in SpandexDatadog.ApiServer.handle_call/3>}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
GenServer terminating:
exited in: GenServer.call(SpandexDatadog.ApiServer, {:send_trace, ... 30000)
    ** (EXIT) exited in: Task.await(%Task{owner: #PID<0.2917.0>, pid: #PID<0.9250.0>, ref: #Reference<0.698663207.2437152781.202107>}, 5000)
        ** (EXIT) time out

Now I cannot pinpoint a specific reason the Spandex GenServer itself crashes, but perhaps the timeout could simply be hit when there is a queue or something?

Any help/directions on how to debug this would be greatly appreciated.

It looks like you're hitting the limit of sync_threshold and Spandex is trying to apply back-pressure on the rest of the system, but the load spike isn't resolving itself within the 5-second default timeout in this Task.await/2 call: https://github.com/spandex-project/spandex_datadog/blob/master/lib/spandex_datadog/api_server.ex#L178

What do you have your batch_size and sync_threshold set to? Those are the main things to tune so that your system can push through a lot of traces but not bog down the whole system with sending traces to the Datadog Agent.

I'm also curious what kind of request rate you're seeing, because we've pushed this pretty far with batch_size and sync_threshold each set to 100. Then again, we don't use very large instances, so each of our hosts is not handling that much traffic. If you have one huge instance handling a ton of traffic, I think you'd need to tune things differently (and probably do some tuning on your HTTP client as well).

Thank you so much for your time to respond, honestly I wasn’t expecting much help with such a vague issue, but this helps me greatly. Currently, we are using the defaults. I tried searching for config options but somehow totally missed these (they might warrant a separate header somewhere)

This is actually not that big of an app. I dug into some logs and we are tracing only a couple of requests a second, the purp seems to be the ecto tracer in combination with exq, we do quite a bit of ecto stuff in the background. Lots of tiny queries, the ones that do arrive In DataDog peak at around 700 ecto spans per second, but that might be cut off due to the dying Genserver. The ecto span is probably also part of a larger decorated span, so if those count separate it’s probably double that.

Would this already be in the range of causing the issue? What kind of options should I be trying? 100/100? Either way, I will play with some numbers, this gives me more stuff to try :)

Ah cool! Yeah if you weren't aware of these options and you're pushing something like 700 ecto spans per second in addition to various other decorations and such, I think you're getting into the territory where it makes sense to at least think about these options.

Another thing you might hit that I've seen before with these larger traces is that Datadog's Agent will semi-silently drop traces where the trace payload is over 10 MB (which isn't that hard to do when you have a a few dozen traces that each have hundreds of Ecto calls being traced with full SQL queries in them). You can see it in the DD Agent logs, but it doesn't throw an error IIRC. The way I've solved this is to decrease the batch_size so that each batch of traces fits under 10 MB at least most of the time (this is after being msgpack-encoded, I think). The trade-off with smaller batch size (maybe 10 or 20 instead of 100) is that you need to make more HTTP calls, but if this is part of the problem, then that's probably not an issue since each HTTP transfer is going to be quite large and it makes sense to split it into smaller ones.

The sync_threshold option is how many parallel HTTP calls you make at a time before you block the call to finish_trace. Since you seem to be hitting this threshold, it's hard to say whether you should make it higher or lower because if you make it lower then you might just end up having more of them blocked and waiting for the queue to flush, but you keep accepting new connections on the Phoenix side, so they're essentially still allowing more work into the system. If you increase it, it will probably help with this problem of the SpandexDatadog GenServer crashing (because less tasks will be blocked and waiting for a turn), but you might end up exhausting other resources, like Hackney pool connections or OS-level TCP sockets. In my case, we have set up a dedicated Hackney connection pool for tracing and configured SpandexDatadog to use that customized HTTP client instead of the default, which is to use HTTPoison with the default Hackney pool.

This might be more information that you were hoping for, but maybe it will be helpful to you as you experiment, or to someone else who comes along later and sees this. 😁 🚀

I increased the batch size to 100 (from the default 10) and this seems to have fixed the issue! I initially tried this using the provided environment variables like in the example but this did not work as these are actually strings and will cause the app to fail to boot as it expect integers.

Thank you so much for your help!

Awesome! For our apps, we just have some boiler-plate config that parses values out of environment variables (if present) and passes them to configure the server in the application supervision tree.