[BUG] Extra samples added when Cloudwatch metric stops
qcattez opened this issue · 6 comments
Is there an existing issue for this?
- I have searched the existing issues
YACE version
v0.61.2
Config file
apiVersion: v1alpha1
sts-region: us-west-2
discovery:
exportedTagsOnMetrics:
AWS/Firehose:
- Environment
- Platform
jobs:
- type: AWS/Firehose
regions:
- us-west-2
addCloudwatchTimestamp: true
delay: 600
period: 300
length: 1200
nilToZero: true
statistics:
- Sum
metrics:
- name: DeliveryToRedshift.Records
- name: DeliveryToRedshift.Bytes
- name: DeliveryToRedshift.Success
- name: DeliveryToS3.Records
- name: DeliveryToS3.Bytes
- name: DeliveryToS3.Success
Current Behavior
When configured with a period of 5m
and a matching scraping-interval
, the last sample of a metric is duplicated by YACE
.
Here is the Cloudwatch graph :
And here is the Grafana graph :
Querying the metric with the AWS CLI shows that no extra samples are present (the duplicated value is 143170.0
on the Grafana graph):
{
"MetricDataResults": [
{
"Id": "m1",
"Label": "DeliveryToRedshift.Records",
"Timestamps": [
"2024-07-29T11:15:00+00:00",
"2024-07-29T11:10:00+00:00",
"2024-07-29T11:05:00+00:00",
"2024-07-29T11:00:00+00:00",
"2024-07-29T10:55:00+00:00",
"2024-07-29T10:50:00+00:00",
"2024-07-29T10:45:00+00:00",
"2024-07-29T10:40:00+00:00",
"2024-07-29T10:35:00+00:00",
"2024-07-29T10:30:00+00:00",
"2024-07-29T10:25:00+00:00",
"2024-07-29T10:20:00+00:00",
"2024-07-29T10:15:00+00:00"
],
"Values": [
143170.0,
4615248.0,
2366956.0,
3730344.0,
3370700.0,
3448014.0,
3160632.0,
2367084.0,
3018033.0,
2441417.0,
2579474.0,
1866146.0,
2005916.0
],
"StatusCode": "Complete"
}
],
"Messages": []
}
Note : 1 extra sample is added with a period of 5m
. When the period is 1m
, 4 extra samples are generated.
Expected Behavior
No extra samples are generated.
Steps To Reproduce
We deploy YACE
with kustomize :
kustomization.yaml
:
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cloudwatch-exporter
helmCharts:
- name: yet-another-cloudwatch-exporter
repo: https://nerdswords.github.io/helm-charts
version: 0.38.0
releaseName: cloudwatch-exporter
namespace: cloudwatch-exporter
valuesFile: values.yaml
apiVersions:
- monitoring.coreos.com/v1
values.yaml
:
fullnameOverride: cloudwatch-exporter
serviceAccount:
create: false
name: cloudwatch-exporter
securityContext:
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
testConnection: false
resources:
requests:
cpu: 10m
memory: 128Mi
limits:
memory: 128Mi
extraArgs:
scraping-interval: 300
serviceMonitor:
enabled: true
interval: 300s
config: |-
apiVersion: v1alpha1
sts-region: us-west-2
discovery:
exportedTagsOnMetrics:
AWS/Firehose:
- Environment
- Platform
jobs:
- type: AWS/Firehose
regions:
- us-west-2
addCloudwatchTimestamp: true
delay: 600
period: 300
length: 1200
nilToZero: true
statistics:
- Sum
metrics:
- name: DeliveryToRedshift.Records
- name: DeliveryToRedshift.Bytes
- name: DeliveryToRedshift.Success
- name: DeliveryToS3.Records
- name: DeliveryToS3.Bytes
- name: DeliveryToS3.Success
To generate the manifests : kustomize build . --enable-helm
Don't mind the serviceAccount.create: false
, we create it elsewhere with the IRSA
annotation.
Anything else?
Hope someone can see help 🙏
👋 @qcattez, the behavior you are seeing is due to the usage of nilToZero: true
in your configuration. Removing it or explicitly setting it to false (the default is false) should fix your issue.
What's happening is that CloudWatch continues to export resources in ListMetrics
for a variable length of time even when they have no data points. When the exporter calls GetMetricData
and no data point is found in the time window nilToZero
dictates if a zero should be produced or a nil. Producing zero gives you continuity in your graph panels while nil leaves gaps much like cloudwatch.
Hi @kgeckhart 👋
Thanks for helping out ;)
I tried setting nilToZero: false
but it doesn't change the behavior I mentionned : there is still an extra sample with the same value (not zero).
Furthermore, as we can see in the screenshots and graphs, even if I had nilToZero: true
, I didn't get a continuity in my graph :/
Do you have another idea ?
Just commenting to get some help on this 🙏
You say you want continuity in your graphs and for that to happen you need consistent sampling or prometheus will mark the value as stale. Nils can be connected through settings if you are using grafana. Continuity cannot be achieved if the sample is dropped which is more important?
I think there's a misunderstanding : my problem is about a wrong extra sample, not about continuity.
I annotated the screenshots to be more explicit.
Here are the samples from Cloudwatch :
And here are the samples retrieved by YACE and presented in Grafana :
We see that the data is wrong with a sample that shouldn't exist.
nilToZero
has no impact on this behavior.
Can someone reproduce it ?
Is it expected looking at my configuration ?
For anyone stumbling upon the same issue, I finally had my cloudwatch metrics without errors.
Updated fields in my values.yaml
:
extraArgs:
scraping-interval: 60
serviceMonitor:
enabled: true
interval: 300s
config: |-
apiVersion: v1alpha1
sts-region: us-west-2
discovery:
exportedTagsOnMetrics:
AWS/Firehose:
- Environment
- Platform
jobs:
- type: AWS/Firehose
regions:
- us-west-2
delay: 600
period: 300
length: 300
nilToZero: true
statistics:
- Sum
metrics:
- name: DeliveryToRedshift.Records
- name: DeliveryToRedshift.Bytes
- name: DeliveryToRedshift.Success
- name: DeliveryToS3.Records
- name: DeliveryToS3.Bytes
- name: DeliveryToS3.Success
A scraping-interval
of 60
instead of 300
removed the wrong extra samples at the end.
A length
of 300
was enough.
addCloudwatchTimestamp: true
was introducing wrong samples in the middle of the timeserie.
nilToZero: true
now performs as expected.