DataDog/datadog-agent

under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service

Opened this issue ยท 27 comments

danbf commented
sorry can't paste agent info here as it's a fargate sidecar and i can't ssh into it.

Describe what happened:
under reporting of count metrics is observed when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks instances per services. when using a Service Type of replica and Number of tasks > 1 then count metrics are under reported by the 1/<Number of tasks>. this only occurs for Number of tasks > 1.

this happens as a result of two behaviors.

  1. metrics of type count only accept one count per sample interval for a single source. any more that are received are considered duplicates for that sample interval are considered extraneous and dropped. this is normal behavior.
  2. aws fargate assigns each task instance running for a single service and task definition to the same hostname parameter value. this is the current aws fargate behavior.
they seem to get a hostname of the format:
`fargate_task:arn:aws:ecs:<region>:<account>:task/prod/<task identifier>`

but the `<task identifier>` is not set to be unique.

as a result the count metrics from each service's task instance are considered as coming from the same source(hostname), and so one count metrics is processed for each sample interval with the remaining discarded. this reduces the summed count per interval to only count metric rather then the sum of multiple counts. if each of the service's task instance's had a unique hostname set by aws fargate then all the count metrics would be processed and summed together as the summed count for that sample interval.

while hostname is not set uniquely per task instance for a service, there is a parameter that is, the TaskARN and it's available to the container via the Task Metadata Endpoint https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v2.html .

so incorporating something like below that leverages the uniqueness of the TaskARN in the ecs entrypoint for the datadog agent https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-ecs.sh#L13 would fix that by setting the DD_HOSTNAME to something unique per task instance.

if [[ -n "${ECS_FARGATE}" ]]; then
  taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk  -F\" '{print $1}')
  export DD_HOSTNAME=$taskid
fi

this is based off of #2288 (comment) and https://github.com/aws/amazon-ecs-agent/issues/3#issuecomment-437643239 and we have confirmed this is working setting out dockerfile to:

FROM datadog/agent:6.10.1

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]

and our /entrypoint.sh to

#!/bin/bash

if [[ -n "${ECS_FARGATE}" ]]; then
  taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk  -F\" '{print $1}')
  export DD_HOSTNAME=$taskid
fi

/init

Describe what you expected:
count metrics should be counted from each task instance run in fargate.

Steps to reproduce the issue:

  1. setup a fargate aws service that using a Service Type of replica and Number of tasks > 1 and utilizes the datadog container as a sidecar per https://www.datadoghq.com/blog/monitor-aws-fargate/
  2. have that service's container produce a count metric and have it upload that to datadog via the DogStatsD interface.
  3. that count metrics should be under reported

Additional environment details (Operating System, Cloud provider, etc):
AWS Fargate
DataDog agents datadog/agent:6.5.2 and datadog/agent:6.10.1 from dockerhub

danbf commented

on a regular aws ecs host the hostname seems to be set at the docker container id which you can get via the docker ps command on that host

danbf commented

@mikezvi i'm suspecting this is related to #2288 .

seems without a host object then several task instances are all somehow assigned the same host object.

danbf commented

@mfpierre i'm wondering if this commit 5867aae has the un-intended consequence of having DogStatsD relayed metrics show up with no host set for them and as a result count metrics under-reporting when a service runs several task instances?

we have confirmed that if the DD_HOSTNAME is set uniquely on a per task instance basis in fargate that the metric under-reporting goes away. could we keep the hostname detected, but still leave out the host checks maybe?

danbf commented

@mfpierre Please see the text below which describes the condition i think this commit 5867aae inadvertently triggered, especially for count metrics:

Note: When removing the host tag, you are removing a unique identifier for the submission of custom metrics. When two datapoints are submitted with the same timestamp/metric/tag combination and do not have unique identifiers, the last received/processed value overwrites the value stored. To avoid this edge case, ensure that no host is submitting the same exact metric/tag combination at any given timestamp.

from: https://docs.datadoghq.com/developers/faq/how-to-remove-the-host-tag-when-submitting-metrics-via-dogstatsd/#pagetitle

Hey @danbf thanks for the report, indeed the removal of the hostname looks problematic in your use case.

One thing you could do instead of setting DD_HOSTNAME is using DD_DOGSTATSD_TAGS to inject the task_arn tag to all the dogstatsd metric.

The other solution would be to set the agent dogstatsd tagger in orchestrator mode to be able to inject automaticaly the taks_arn tag in dogstatsd metrics, for this to work you'll need to set up dogstatsd origin detection though (via UDS) (see https://docs.datadoghq.com/developers/dogstatsd/unix_socket/) + setup the dogstatsd cardinality in the agent via DD_DOGSTATSD_TAG_CARDINALITY (with the warning it implies)

// The cardinality of tags to send for checks and dogstatsd respectively.
// Choices are: low, orchestrator, high.
// WARNING: sending orchestrator, or high tags for dogstatsd metrics may create more metrics
// (one per container instead of one per host).
// Changing this setting may impact your custom metrics billing.
config.BindEnvAndSetDefault("checks_tag_cardinality", "low")
config.BindEnvAndSetDefault("dogstatsd_tag_cardinality", "low")

danbf commented

@mfpierre here's the thing. it seems the default behavior of the datadog agent in fargate is to drop metrics that would get reported by the same version datadiog agent running on an ecs ec2 cluster. And that is exactly what we saw when we moved a service from an ecs ec2 cluster to fargate. what other things are affected by say backing out 5867aae and letting the host get set in fargate as well.

hkaj commented

Hi @danbf
In Fargate, we purposely strip the hostname. We're not able to pull any of the host data to the DD agent because AWS hides it from us (which is exactly the point, you don't have to worry about the host here).

That includes core Datadog agent metrics like system.*, host metadata payloads, as well as the host tags. Traditionally with ECS, the host tag and its aliases were pulled from the existing EC2 + metadata endpoints, both which are not available in Fargate.

Another reason why we disable hostname, and host tags is to avoid that tasks show up on your bill as hosts. I assure you it wouldn't be in your favor ๐Ÿ˜„

The solution @mfpierre recommends is our solution for this use case, please let us know if it doesn't work for you, we would be interested to understand why.

danbf commented

@hkaj is this documented anywhere in the fargate datadog agent deployment guide at the least i think it should. i'm not seeing it here: https://www.datadoghq.com/blog/monitor-aws-fargate/ or here: https://docs.datadoghq.com/integrations/ecs_fargate/

but i fundamentally disagree. implementing PR, #1182 did a bunch of things in disabling the host level checks. i think it went just one step too far, and disabled the host tag. from what i see, the host level checks and metrics could be disabled and yet the setting of the host tag for DOGSTATSD could have been maintained. i'm also hopeful that any billing issues could be worked out.

hkaj commented

The point of Fargate is to abstract the host away, and focus on the task. We respected this when building the integration, and removed the host tag.

Even if we had kept the host tag, there's no information about the EC2 instance available via the API or otherwise. os.Hostname() returns the task name, looking for the fqdn name would give either the task name or some name that contains the IP address of the ENI, that, again is tied to a task, not the host.

It also doesn't make sense to surface anything host-related about Fargate workload, even if we could retrofit the task name in the host field (which we could, you're right). It would make the product more confusing, since Fargate users don't expect to have to care about hosts.

The recommended way to differentiate tasks is to use the solution that mfpierre suggested. If this is not satisfying we're open to other suggestions of how we can make the user experience better here, but using the host tag for a task is not it.

@hkaj Will the proposed solution of @mfpierre be a problem if we deploy multiple times a day, yielding a lot of different task-ids and therefore lots of different tags? The documentation [1] states that the number of metrics is limited, so we tried to avoid tags like that until now and also used the hostname tag (which works for our use cases so far).

[1] https://docs.datadoghq.com/developers/metrics/custom_metrics/#how-many-custom-metrics-am-i-allowed

hkaj commented

@tom-mi you're right, it will impact billing, because it creates one time series per task instance. It really depends on what level of granularity you need. If you don't need visibility about your custom metrics per task, i'd suggest not setting any task-level tag, to reduce the # of time series. If you need to aggregate them by task, you will need to add the task arn in there.

Frustrated DataDog customer here. ๐Ÿ‘‹ Between this issue and #2288, I'd say the current DataDog agent behavior is going to be problematic for the large majority of Fargate users. It's unintuitive, confusing, and pretty much undocumented. Basic stuff like making sure there isn't any one task instance that's low on memory or counting number of requests served by all tasks isn't possible without custom configuration!

Hi @jfirebaugh

... making sure there isn't any one task instance that's low on memory or counting number of requests served by all tasks isn't possible without custom configuration!
You can still scope the metrics ecs.fargate.* over the container_id, container_name and the ecs_container_name tags to do this in addition of the task_arn(which is unique).

The only caveat with the current setup is with dogstatsd and using multiple instances of the same task.
We have a feature request opened on our side to add the task_arn as a tag when sending custom metrics with dogstatsd (this would be the same as the agent as both containers are running in the same task). It should resolve the issue by giving a unique tag (with a higher cardinality) without adding a hostname to the agent, which could cause billing issue.

Please reach out to our support team (support@datadoghq.com) if you'd like to open another feature request that you think is relevant.

Simon

The only caveat with the current setup is with dogstatsd and using multiple instances of the same task.

Sure, but using multiple instances is what everyone who wants redundancy or to scale horizontally will be doing. It's one of the main attractions of containerization.

We have a feature request opened on our side to add the task_arn as a tag when sending custom metrics with dogstatsd (this would be the same as the agent as both containers are running in the same task). It should resolve the issue by giving a unique tag (with a higher cardinality) without adding a hostname to the agent, which could cause billing issue.

That's great to hear! I think it will resolve the issue to everyone's satisfaction. It's almost exactly what I've implemented manually as a workaround, only I send just the task ID (last part of the ARN). My variant of @danbf's entrypoint script:

#!/bin/bash

if [[ -n "${ECS_FARGATE}" ]]; then
  task_id=$(curl --silent 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk -F\" '{print $1}')
  export DD_TAGS="$DD_TAGS task_id:$task_id"
  export DD_DOGSTATSD_TAGS="$DD_DOGSTATSD_TAGS task_id:$task_id"
fi

/init

Update for those who may be using the above workaround themselves: I found that the grep/awk pipeline did not reliably extract the correct value. I replaced it with jq:

  task_id=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')

and added jq installation to the Dockerfile:

RUN apt-get update && apt-get install -y jq && rm -rf /var/lib/apt/lists/*

@Simwar do you have an ETA for when the feature request will be worked on and released?

Can this issue be solved by using the DD_DOGSTATSD_TAG_CARDINALITY=orcehstrator which seems to append the task ARN automatically? (possible billing surcharges still being an issue)

Can we get a bit more clarification on the costs you mention? I somewhat understand the DD_DOGSTATSD_TAGS cost already since that just seems to be an extra tag attached to each metric.

But how does adding DD_HOSTNAME affect my cost? Is it just the cost of increased metrics or do I get charged for 1000 unique hosts if I had 1000 unique task instances in a month?
https://www.datadoghq.com/pricing/#section-infrastructure says $1 per Fargate Task. Are the costs mentioned here with using DD_HOSTNAME in addition to the $1 I'd already be paying? A breakdown would be helpful to get clarity on which solution to pick - DD_HOSTNAME or DD_DOGSTATSD_TAGS.

For context, we deploy our application multiple times a day and probably have around 70-80 task instances per deploy.

danbf commented

we've switched away from bash for this, but will be looking at #5324 shortly. also would like to know the cost trade-off here.

our latest entrypoint.sh here:

#!/bin/bash

set -e
set -o pipefail

if [[ -n "${ECS_FARGATE}" ]]; then

  until [ -n "${private_ip}" ]; do
    private_ip=$(curl  --silent 169.254.170.2/v2/metadata | python -c "import json, sys; print(json.loads(sys.stdin.read())['Containers'][0]['Networks'][0]['IPv4Addresses'][0])")
  done

  export DD_HOSTNAME=fargate-$private_ip

fi

/init

We've tried #5324 but it seems like this does not always work. Multiple times when AWS autoscales and starts a new task the task does not send its task_arn.
image

This can cause the mentioned under reporting as soon as two tasks are started and do not send the task_arn. This has already happened to us despite having DD_DOGSTATSD_TAG_CARDINALITY= orchestrator.

Hi @SteffenDE, thanks for looking into the DD_DOGSTATSD_TAG_CARDINALITY=orchestrator setting and for reporting this. Would you be able to open up a support ticket so that our team can look into why the task_arn isn't being added as a tag?

Just opened a ticket (#389042). I hope this helps you to find the cause of this issue.

I would like to +1 @SteffenDE's problem, we are encountering the same issue where task-arn is N/A.
We have also opened a ticket (#388***) but still without resolution.

Appreciate if you could bump up the priority of this since multiple people are getting the same issue.

Datadog is currently investigating this. For now we're using this workaround adapted from the comments above:

#!/bin/bash

set -e
set -o pipefail

if [[ -n "${ECS_FARGATE}" ]]; then
  echo "datadog agent starting up in ecs!"
  echo "trying to get task_arn from metadata endpoint..."

  until [ -n "${task_arn}" ]; do
    task_arn=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')
  done

  echo "got it. starting up with task_arn $task_arn"
  export DD_HOSTNAME=task-$task_arn

fi

/init
FROM datadog/agent:7

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

RUN apt-get update \
    && apt-get install --no-install-recommends -y jq \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ENTRYPOINT ["/entrypoint.sh"]

I found this issue by accident while trying to find the documentation on the "correct" way to work with custom metrics in Fargate, and whether sidecars were still the preferred option.

Since reading through it I admit I'm now very glad that I'm aware of the problem before any of our teams run up against it the hard way, and disappointed that I don't seem to see it mentioned in the blog posts or docs about using DD with Fargate.

I'm also not entirely clear, at this stage, on what's needed to have reliable counts for multiple tasks that are part of the same service.

Do I need to now supply DD_DOGSTATSD_TAG_CARDINALITY=orchestrator as part of the DD agent sidecar in my task definition?

If I do this, will it add cardinality for any custom metric reported by that task, similar to how metrics from an individual EC2 host would add cardinality?

Is this in the docs now, and I just missed it?

While the workaround DD_DOGSTATSD_TAG_CARDINALITY=orchestrator has worked until now (agent version 7.25.1), since agent version 7.26.0 metrics have reverted to under reporting.

I've tried a number of environment variables with the latest version but none resolve the issue. Here are a number of settings I've tried:

  • DD_DOGSTATSD_TAG_CARDINALITY=high
  • removing DD_DOGSTATSD_TAG_CARDINALITY
  • AUTCONFIG_FROM_ENVIRONMENT=false

What has changed in 7.26.0?

While the workaround DD_DOGSTATSD_TAG_CARDINALITY=orchestrator has worked until now (agent version 7.25.1), since agent version 7.26.0 metrics have reverted to under reporting.

I recently filled #7602 for this.