spandex-project/spandex_datadog

UnicodeConversionError causes ApiServer crash

Closed this issue · 2 comments

We are seeing this issue in testing locally and when deployed to production on some of our services that just adopted using Spandex Datadog.

Forgive me if this is something simple and not a bug but I am not the one that implemented it and am having a heck of a time trying to debug what exactly is happening when we see this error.

[error] [18:32:05.271] Task #PID<0.9958.0> started from SpandexDatadog.ApiServer terminating
** (UnicodeConversionError) invalid encoding starting at <<207, 9, 19, 82, 112, 251, 94, 245, 40>>
    (elixir 1.10.4) lib/list.ex:944: List.to_string/1
    lib/exvcr/adapter/hackney/converter.ex:6: ExVCR.Adapter.Hackney.Converter."parse_request_body (overridable 1)"/1
    lib/exvcr/adapter/hackney.ex:42: ExVCR.Adapter.Hackney.generate_keys_for_request/1
    lib/exvcr/handler.ex:27: ExVCR.Handler.get_response_from_cache/2
    lib/exvcr/handler.ex:17: ExVCR.Handler.get_response/2
    (httpoison 1.8.0) lib/httpoison/base.ex:846: HTTPoison.Base.request/6
    (spandex_datadog 1.1.0) lib/spandex_datadog/api_server.ex:160: SpandexDatadog.ApiServer.send_and_log/2
    (spandex_datadog 1.1.0) lib/spandex_datadog/api_server.ex:231: anonymous fn/2 in SpandexDatadog.ApiServer.maybe_flush_traces/1
    (elixir 1.10.4) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
    (stdlib 3.14.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
    (hackney 1.17.4) :hackney.request(:put, "http://localhost:8126/v0.3/traces", [{"Content-Type", "application/msgpack"}, {"X-Datadog-Trace-Count", 1}], [145, [151, [140, [164 | "type"], [162 | "db"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 26, 253, 246>>, [167 | "span_id"], <<207, 81, 215, 254, 176, 217, 250, 251, 255>>, [167 | "service"], [164 | "ecto"], [168 | "resource"], [[217, 83] | "SELECT ....... from ......... where ......."], [166 | "sql.db"], [173 | "repo_database"], [171 | "param_count"], [161 | "2"]], [165 | "error"], [0], [168 | "duration"], <<206, 0, 1, 56, 121>>], [140, [164 | "type"], [163 | "web"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 23, 154, 50>>, [167 | "span_id"], <<207, 25, 16, 150, 89, 227, 57, 154, 221>>, [167 | "service"], [163 | "oms"], [168 | "resource"], [[217, 67] | "MyAppWeb.ParentContext.V1.ChildController.phoenix_controller_pipeline/2"], [169 | "parent_id"], <<207, 69, 184, 136, 169, 18, 144, 97, 53>>, [164 | "name"], [[217, 67] | "MyAppWeb.ParentContext.V1.ChildController.phoenix_controller_pipeline/2"], [167 | "metrics"], [129, [181 | "_sampling_priority_v1"], [1]], [164 | "meta"], [128], [165 | "error"], [0], [168 | "duration"], <<206, 0, 19, 230, 135>>], [140, [164 | "type"], [163 | "web"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 22, 102, 254>>, [167 | "span_id"], <<207, 69, 184, 136, 169, 18, 144, 97, 53>>, [167 | "service"], [163 | "oms"], [168 | "resource"], [[217, 51] | "Elixir.MyAppWeb.ParentContext.V1.ChildController.update"], [169 | "parent_id"], <<207, 57, 43, 142, 229, 191, 162, 23, 10>>, [164 | "name"], [179 | "phx.router_dispatch"], [167 | "metrics"], [129, [181 | "_sampling_priority_v1"], [1]], [164 | "meta"], [128], [165 | "error"], [0], [168 | "duration"], <<206, 0, 21, 66, 161>>], [139, [164 | "type"], [163 | "web"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"], <<207, 22, 130, 36, 169, 71, 20, 106, 53>>, [167 | "span_id"], <<207, 57, 43, 142, 229, 191, 162, 23, 10>>, [167 | "service"], [163 | "oms"], [168 | "resource"], [[217, 32] | "PATCH /parent_route/v1/child_route/:id"], [164 | "name"], [167 | "request"], [167 | "metrics"], [129, [181 | "_sampling_priority_v1"], [1]], [164 | "meta"], [131, [168 | "http.url"], [190 | "/parent_route/v1/child_route/1234567"], [176 | "http.status_code"], [163 | "400"], [171 | "http.method"], [165 | "PATCH"]], [165 | "error"], [0], [168 | "duration"], <<206, 0, 23, 93, 159>>]]], [])

Seems to be the first bitstring in the hackney request.
I am not sure where that comes from, we do not make use of bitstrings in this service so I am not sure where it originates.
Could be a utf-8 encoding issue with something that is expected to be parsable to a string, I have seen that in Nerves projects where we make heavy use of bitstrings and have used the wrong number to represent the character.

 (hackney 1.17.4) :hackney.request(:put, "http://localhost:8126/v0.3/traces", [{"Content-Type", "application/msgpack"}, {"X-Datadog-Trace-Count", 1}], [145, [151, [140, [164 | "type"], [162 | "db"], [168 | "trace_id"], <<207, 9, 19, 82, 112, 251, 94, 245, 40>>, [165 | "start"]..., 

I do not really understand the format of the trace either, this may be something simple I am unaware of b/c I have yet to do much with telemetry stuff.

Any help would be greatly appreciated in figuring out what is causing this to error and blow up.

Hey there, thanks for opening an issue! I think the problem might be that ExVCR doesn’t work with the Msgpack protocol that Datadog uses. I’ve seen that before, and the solution for me was to have ExVCR ignore those since there’s not much sense in mocking those calls anyway unless you wanted to assert that certain telemetry was sent (but in that case, I’d recommend using your own replacement for the Adapter or ApiServer module in test, or something like that). I believe ExVCR has an option to ignore certain hosts IIRC, but I haven’t used it in a while.

Please let me know if that solves it for you! 🚀

Thank you @GregMefford for all your help. It is a testing issue with ExVcr.
This solves the issue thank you