[metrics]: Exporter randomly detached from service-account
b-lancaster opened this issue · 2 comments
Context information
- AWS service: Opensearch and Lambda
- CloudWatch namespace: AWS/ES and AWS/Lambda
- Link to metrics documentation for this service: Opensearch and Lambda
- AWS region of the exporter: us-east-1
- AWS region of the service: us-east-1
- Exporter version: 0.15.5
Exporter configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cloudwatch-monitoring-general
namespace: monitoring
data:
config.yml: |
---
region: us-east-1
delay_seconds: 0
set_timestamp: false
use_get_metric_data: true
metrics:
- aws_namespace: AWS/Lambda
aws_metric_name: Errors
aws_dimensions: [FunctionName]
aws_statistics: [Sum]
- aws_namespace: AWS/Lambda
aws_metric_name: Invocations
aws_dimensions: [FunctionName]
aws_statistics: [Sum]
- aws_namespace: AWS/Lambda
aws_metric_name: Duration
aws_dimensions: [FunctionName]
aws_statistics: [Average]
- aws_namespace: AWS/Lambda
aws_metric_name: Throttles
aws_dimensions: [FunctionName]
aws_statistics: [Sum]
- aws_namespace: AWS/Lambda
aws_metric_name: OffsetLag
aws_dimensions: [FunctionName]
aws_statistics: [Maximum]
- aws_namespace: AWS/ES
aws_metric_name: ThreadpoolIndexQueue
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: ThreadpoolWriteQueue
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: ThreadpoolSearchQueue
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: ThreadpoolIndexQueue
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: ThreadpoolWriteQueue
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: ThreadpoolSearchQueue
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: WriteLatency
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: WriteLatency
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: ReadLatency
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: SearchLatency
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: SearchLatency
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: IndexingLatency
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: IndexingLatency
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: IndexingRate
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: IndexingRate
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: SearchRate
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: SearchRate
aws_dimensions: [ClientId, DomainName]
aws_extended_statistics: [p95]
- aws_namespace: AWS/ES
aws_metric_name: 5xx
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Sum]
- aws_namespace: AWS/ES
aws_metric_name: 2xx
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Sum]
- aws_namespace: AWS/ES
aws_metric_name: 3xx
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Sum]
- aws_namespace: AWS/ES
aws_metric_name: 4xx
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Sum]
- aws_namespace: AWS/ES
aws_metric_name: ClusterStatus.red
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Maximum]
- aws_namespace: AWS/ES
aws_metric_name: ClusterStatus.yellow
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Maximum]
- aws_namespace: AWS/ES
aws_metric_name: ClusterIndexWritesBlocked
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: Nodes
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Minimum]
- aws_namespace: AWS/ES
aws_metric_name: AutomatedSnapshotFailure
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Maximum]
- aws_namespace: AWS/ES
aws_metric_name: KibanaHealthyNodes
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Minimum]
- aws_namespace: AWS/ES
aws_metric_name: CPUUtilization
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Average]
- aws_namespace: AWS/ES
aws_metric_name: FreeStorageSpace
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Minimum]
- aws_namespace: AWS/ES
aws_metric_name: JVMMemoryPressure
aws_dimensions: [ClientId, DomainName]
aws_statistics: [Maximum]
Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: cloudwatch-exporter
name: cloudwatch-exporter
namespace: monitoring
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<REDACTED_ACCOUNT_NUMBER>:role/CloudWatchMetricsReadOnlyRole
IAM Role
# Role for cloudwatch metrics exporter
rCloudWatchMetricsReadOnlyRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument: !Sub
- |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "${IamOidcProviderArn}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OidcProviderEndpoint}:sub": "system:serviceaccount:monitoring:cloudwatch-exporter"
}
}
}
]
}
- {
"IamOidcProviderArn": !Ref pOidcProviderArn,
"OidcProviderEndpoint": !Ref pIssuerHostPath
}
Path: /
ManagedPolicyArns:
- arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: cloudwatch-metrics-exporter-general
labels:
app.kubernetes.io/name: cloudwatch-metrics-exporter
app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
namespace: monitoring
annotations:
reloader.stakater.com/auto: "true"
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: cloudwatch-metrics-exporter
app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
template:
metadata:
labels:
app.kubernetes.io/name: cloudwatch-metrics-exporter
app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: k8s.swacorp.com/instancegroup
operator: In
values:
- operations-job-nodes
- arm-operations-job-nodes
- key: kubernetes.io/arch
operator: In
values:
- arm64
- amd64
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- arm64
tolerations:
- effect: NoExecute
key: k8s.swacorp.com/dedicated
operator: Equal
value: operations-server
- effect: NoSchedule
key: kubernetes.io/arch
operator: Equal
value: arm64
serviceAccountName: cloudwatch-exporter
containers:
- name: cloudwatch-metrics-exporter
image: quay.io/prometheus/cloudwatch-exporter:v0.15.5
ports:
- containerPort: 9106
resources:
requests:
cpu: 100m
memory: 600Mi
volumeMounts:
- mountPath: /config
name: cloudwatch-metric-general
volumes:
- configMap:
name: cloudwatch-monitoring-general
name: cloudwatch-metric-general
Exporter logs
Mar 15, 2024 4:15:46 PM io.prometheus.cloudwatch.CloudWatchCollector collect
WARNING: CloudWatch scrape failed
software.amazon.awssdk.services.cloudwatch.model.CloudWatchException: User: arn:aws:sts::<REDACTED_ACCOUNT_NUMBER>:assumed-role/<REDACTED_NODE_IAM_ROLE_NAME>/<REDACTED_INSTANCE_ID> is not authorized to perform: cloudwatch:GetMetricData because no identity-based policy allows the cloudwatch:GetMetricData action (Service: CloudWatch, Status Code: 403, Request ID: 7079164c-7404-48cf-98e2-8b13d5ccf27a)
at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)
at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
at software.amazon.awssdk.services.cloudwatch.DefaultCloudWatchClient.getMetricData(DefaultCloudWatchClient.java:1249)
at io.prometheus.cloudwatch.GetMetricDataDataGetter.fetchAllDataPoints(GetMetricDataDataGetter.java:138)
at io.prometheus.cloudwatch.GetMetricDataDataGetter.<init>(GetMetricDataDataGetter.java:185)
at io.prometheus.cloudwatch.CloudWatchCollector.scrape(CloudWatchCollector.java:486)
at io.prometheus.cloudwatch.CloudWatchCollector.collect(CloudWatchCollector.java:642)
at io.prometheus.client.Collector.collect(Collector.java:45)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:204)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:162)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:190)
at io.prometheus.client.CollectorRegistry.metricFamilySamples(CollectorRegistry.java:129)
at io.prometheus.client.servlet.common.exporter.Exporter.doGet(Exporter.java:75)
at io.prometheus.client.servlet.jakarta.exporter.MetricsServlet.doGet(MetricsServlet.java:52)
at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:500)
at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:529)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
at org.eclipse.jetty.server.Server.handle(Server.java:563)
at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
at java.base/java.lang.Thread.run(Unknown Source)
What do you expect to happen?
I expected the cloudwatch-exporter to use the attached service-account with the permissions necessary to retrieve metric data.
What happened instead?
What actually happened was the cloudwatch-exporter stopped using the service-account and tried to use the k8s nodes IAM role. Nothing changed, we just stopped recieving the metrics in prometheus and then found the logs.
Restarting the deployment fixed the problem and it started using the service-account again, but the problem is if this would have happened in a production environment, the prometheus alerts we've setup to monitor these metrics wouldnt have met the threshold needed to fire.
Also, without looking at the logs, the pod appeared to be running as normal.
Same here; I don't know why it's not using the service account, and it prefers to use the Node role.
Same here, on a fresh deployment using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-cloudwatch-exporter
Restarting the Deployment does not fix it. It uses the Karpenter
role