spandex-project/spandex_datadog

msgpack error when sending to agent

Closed this issue ยท 13 comments

We're getting a bit of an obfuscated error. Something serializing in Spandex is bombing out when it hits the agent. The agent isn't giving us any information on what is failing, and logging with verbose in Spandex doesn't provide any detail on what's happening.

13:11:30.839 [info] Sending 10 traces, 160 spans.
2019-03-19 13:11:30 ERROR (api.go:404) - cannot decode v0.3 traces payload: msgp: attempted to decode type "uint" with method for "str"

Not all traces seem to be affected, and we can't quite figure out what piece of data is causing it.

Interestingly, we have a infinite load screen which makes the same call to the server. As we scroll, we can see some of the spans sending work, and others do not.

You can also set the datadog API server to verbose which should print all spans. How are you setting it to verbose currently?

SPANDEX_VERBOSE=true

datadog API server to verbose

Not sure what you mean. We have this set at DD_LOG_LEVEL=DEBUG currently and there is no additional information.

Sorry, I mean the spandex genserver that sends the information to the datadog agent. Somewhere you'll have something like this:

# Example configuration
opts =
  [
    host: System.get_env("DATADOG_HOST") || "localhost",
    port: System.get_env("DATADOG_PORT") || 8126,
    batch_size: System.get_env("SPANDEX_BATCH_SIZE") || 10,
    sync_threshold: System.get_env("SPANDEX_SYNC_THRESHOLD") || 100,
    http: HTTPoison
  ]

# in your supervision tree

worker(SpandexDatadog.ApiServer, [opts])

And you can add verbose?: true to that as well. In retrospect, that should honor the SPANDEX_VERBOSE env var as well :)

Ok. We have that set as build time ENV var right now. Will update to make it runtime configurable and follow up.

# Configure tracing verbosity
verbose_trace = System.get_env("SPANDEX_VERBOSE") == "true"
...
# Start Datadog APM worker
worker(SpandexDatadog.ApiServer, [Keyword.put(spandex_opts(), :verbose?, verbose_trace)]),

Verified it's setting verbose? to true, but we're still only seeing the info log. No additional information for spans.

Looks like it will only print out all if the app is set to debug in the logger: https://github.com/spandex-project/spandex_datadog/blob/master/lib/spandex_datadog/api_server.ex#L153

I can't run this through debug at the moment due to other time constraints, but if you're ok to hold open, I'll follow up with additional info. ๐Ÿ™‡

Ah, yeah good call, sorry about that. We probably could just suffice w/ only using debug level logs. Either way, lets leave it open and just let me know when you're ready :D

Just in case the error message itself isn't clear, I think what it's saying is that Datadog Agent is expecting a field to be a string, but Spandex has encoded it as an unsigned integer. My guess is that somewhere. Since you're seeing some and not others, my guess would be that maybe you have some issue with trace ID conversion when continuing a distributed trace between services. It would be very helpful to get a sample of the debug trace output if possible, but I understand that's not trivial to do. ๐Ÿ˜…

This is probably a bug in spandex_datadog where we're not enforcing the same specifications that Datadog requires (but doesn't officially document), so sorry about that!

Thanks for follow up. We should have time this week to look. Appreciate the patience. ๐Ÿ˜

I was just reviewing old issues and wondered whether you ever got to the bottom of this issue, @erikreedstrom, or if you found any more clues that might help someone else?

@erikreedstrom @GregMefford we've ran into a similar issue - it was the tags in our case. Since the "tag values should be strings" is a limitation specific to DataDog, I would argue this should be handled in this adapter. I have created #16 that auto-converts tags to strings - what do you think?

I believe this is resolved by #16