Expose metrics and collect with CloudWatch
miltodorov opened this issue · 15 comments
Is your feature request related to a problem? Please describe.
I have provisioned an EKS Cluster (EC2-backed) with EBS csi driver, CloudWatch addons enabled.
What bothers me is that I don't seem to be getting metrics such as kubelet_volume_stats_available_bytes into CloudWatch (Container Insights have been enabled)
Describe the solution you'd like in detail
It would be very helpful to have Instructions on how to get these metrics?
I haven't even encountered an official resource that states what kind of metrics can be collected.
Describe alternatives you've considered
Tried setting alarms based on CloudWatch EC2 metrics (disk) but it does not work well because the mount points are dynamic (When using EBS csi driver to satisfy pod dynamic volume requirements).
Additional context
I see that metrics are indeed available - when I run:
kubectl get --raw /api/v1/nodes/HOST/proxy/metrics | grep 'kubelet_volume_stats_available_bytes'
I do get results but I expect that these would be available in CloudWatch.
Hi @miltodorov thank you for raising this issue.
Regarding:
I haven't even encountered an official resource that states what kind of metrics can be collected.
I will make sure this work gets prioritized. Our team should provide some kind of overview on what metrics we expose. Thank you for this suggestion, this is something we can improve on.
It would be very helpful to have Instructions on how to get these metrics? ... I do get results but I expect that these would be available in CloudWatch.
Our team makes sure CSI Driver metrics get exposed at the appropriate endpoints, we do not have insight over how the Amazon CloudWatch Observability add-on scrapes those metrics. You are able to scrape these metrics with Prometheus.
You may also want to file an EKS Support Ticket or file an issue on the aws/containers-roadmap about either having prometheus support with the observability add-on, or built-in observability for persistent volumes and CSI Drivers.
@AndrewSirenko Thank you for your support on this!
To be honest, I find it strange that the data is there but it's not being collected by CloudWatch.
Hopefully, the situation will change in the near future.
Hi @miltodorov, I found this documentation, apologies for not finding it earlier, does it suit your needs? aws-ebs-csi-driver/docs/metrics.md at master · kubernetes-sigs/aws-ebs-csi-driver
If not, what else would you like to see added?
Thank you!
Hi @AndrewSirenko - Thank you for your time and dedication!
There is one problem with that solution - it relies on Prometheus and, if i am not mistaken, that means more configuration needed to get Prometheus data into CloudWatch.
I did install the driver via Helm and we have no issues with provisioning it this way, instead of as an Addon (we use Terraform) but it would be best if you could provide me with a way to get those metrics ingested into CloudWatch.
In fact, I think it would be best if both CloudWatch and the EBS driver work together by default - collect Volume Stats Metrics with no user input required, apart from activating both Addons.
Kindest regards!
Can you elaborate on this @miltodorov ? I am hitting a problem where I cannot find kubelet_volume_stats_capacity_bytes
at all , not even querying the metric with kubectl.
I have EKS 1.30 with EBS CSI driver v.1.32.0-eksbuild.1 deployed by addon, is it necessary to deploy it with Helm instead?
/close
I haven't even encountered an official resource that states what kind of metrics can be collected.
For metrics provided by the Kubelet, Kubernetes documents the available metrics here: https://kubernetes.io/docs/reference/instrumentation/metrics/ (look for the metrics that begin with kubelet_volume_
Metrics exposed by the driver itself directly are documented here: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md
For CloudWatch ingestion, I am closing this issue in favor of the containers-roadmap issue, because it is a feature request for EKS and not in the EBS CSI Driver itself: aws/containers-roadmap#2377 I would also recommend any customers with AWS support contracts and/or a TAM/SA assigned to their account reach out to those sources to request this feature and increase its priority on EKS's side.
@ConnorJC3: Closing this issue.
In response to this:
/close
I haven't even encountered an official resource that states what kind of metrics can be collected.
For metrics provided by the Kubelet, Kubernetes documents the available metrics here: https://kubernetes.io/docs/reference/instrumentation/metrics/ (look for the metrics that begin with
kubelet_volume_
Metrics exposed by the driver itself directly are documented here: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md
For CloudWatch ingestion, I am closing this issue in favor of the containers-roadmap issue, because it is a feature request for EKS and not in the EBS CSI Driver itself: aws/containers-roadmap#2377 I would also recommend any customers with AWS support contracts and/or a TAM/SA assigned to their account reach out to those sources to request this feature and increase its priority on EKS's side.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@irizzant Sorry for the late reply!
In my case, I'm using EKS with CloudWatch and EBS Addons enabled!
I believe in order to be able to scrape that metric - CloudWatch Addon needs to be enabled.
@miltodorov thanks for the reply
I think there's some confusion here.
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/metrics.md shows that kubelet_volume_stats_capacity_bytes
is available as metric, and it should be emitted by the EBS CSI controller.
It also says that enableMetrics
should be true:
Enable metrics by setting enableMetrics: true in values.yaml.
I haven't found a way to set enableMetrics
using the EBS addon, please let me know @ConnorJC3 @miltodorov if I'm wrong here.
Consequently I think the only way is to install the EBS CSI driver via Helm chart.
@irizzant
Yes, you are indeed right - now i remember I did have a custom CloudWatch config.
Quite some time has passed since the last time I used this but I can indeed confirm I was experimenting with a custom approach at one point.
This was one of the reasons I submitted this Issue - imo this metric needs to be collected by default.
That document is misleading, I'll get it fixed. Any metric beginning with kubelet_volume_
is emitted by the kubelet and not the CSI driver itself.
I'll also open an issue internally to expose enableMetrics
in the addon version of the driver. In the meantime, you can enable the metrics manually by passing --http-endpoint=0.0.0.0:3301
to controller.additionalArgs
(which should be available in the addon). You'll additionally need to deploy your own Service
and ServiceMonitor
if using Prometheus, example here:
That document is misleading, I'll get it fixed. Any metric beginning with kubelet_volume_ is emitted by the kubelet and not the CSI driver itself.
@ConnorJC3 this makes a lot more sense.
I've tried what you suggested:
And I have also deployed a Service as shown here
but when I point the browser to http://localhost:3301/metrics after enabling port forwarding all I get is a blank page
@ConnorJC3 Could it be that exposing the endpoint at controller level is not enough? Maybe there is another arg that you must pass to the controller to enable metrics?
@irizzant golang's prometheus library doesn't return metrics to the HTTP endpoint until they are observed at least once, so a blank endpoint would be expected immediately after enabling metrics - please mount/create/etc some volumes and recheck the metrics after that. (Also, if you are using multiple replicas of the driver, as is the default, you may need to hit the service multiple times until you get the driver that is currently operating as the leader.)
@ConnorJC3 thanks for the updates.
Created the Service again, did the port-forward, and now I get these metrics.
# HELP cloudprovider_aws_api_request_duration_seconds [ALPHA] ebs_csi_aws_com metric
# TYPE cloudprovider_aws_api_request_duration_seconds histogram
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.1"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.25"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="1"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="2.5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="10"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="+Inf"} 2
cloudprovider_aws_api_request_duration_seconds_sum{request="AttachVolume"} 0.998204049
cloudprovider_aws_api_request_duration_seconds_count{request="AttachVolume"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.1"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.25"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="0.5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="1"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="2.5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="10"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="CreateVolume",le="+Inf"} 2
cloudprovider_aws_api_request_duration_seconds_sum{request="CreateVolume"} 0.5779328530000001
cloudprovider_aws_api_request_duration_seconds_count{request="CreateVolume"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.1"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.25"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="1"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="2.5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="5"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="10"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="+Inf"} 2
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeInstances"} 0.153815237
cloudprovider_aws_api_request_duration_seconds_count{request="DescribeInstances"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.1"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.25"} 5
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.5"} 5
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="1"} 5
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="2.5"} 5
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="5"} 5
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="10"} 5
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="+Inf"} 5
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeVolumes"} 0.522737136
cloudprovider_aws_api_request_duration_seconds_count{request="DescribeVolumes"} 5
If I hit the kubelet metrics:
kubectl get --raw /api/v1/nodes/<node>/proxy/metrics | grep 'kubelet_volume_stats_available_bytes'
# HELP kubelet_volume_stats_available_bytes [ALPHA] Number of available bytes in the volume
# TYPE kubelet_volume_stats_available_bytes gauge
kubelet_volume_stats_available_bytes{namespace="kubecost",persistentvolumeclaim="kubecost-cost-analyzer"} 3.3413926912e+10