kubernetes-sigs/aws-ebs-csi-driver

Failed to retrieve instance data from ec2 metadata

brizaldi opened this issue · 12 comments

We're currently using CIS Amazon Linux 2 running on kubernetes version 1.29 and getting this error:

ebs-csi-node

I0508 04:49:41.864583       1 ec2.go:40] "Retrieving EC2 instance identity metadata" regionFromSession=""
I0508 04:49:41.864780       1 metadata.go:52] "failed to retrieve instance data from ec2 metadata; retrieving instance data from kubernetes api" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, request canceled, context deadline exceeded"
E0508 04:50:11.868297       1 main.go:154] "Could not determine region from any metadata service. The region can be manually supplied via the AWS_REGION environment variable." err="error getting instance data from ec2 metadata or kubernetes api"
panic: error getting instance data from ec2 metadata or kubernetes api                                                                                                                                                                                                                                                                                                                 

goroutine 1 [running]:
main.main()
    /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:155 +0xf27

Hi @brizaldi, the EBS CSI Driver's node service requires some source of instance/node metadata to function. By default, we attempt to use the EC2 Instance Metadata service, but fallback to querying the Kubernetes API. These errors point to neither source being reachable.

You will need to provide the EBS CSI Node pods with access to either IMDS (for example, by raising the hop limit, see our FAQ) or the Kubernetes API server (by finding and configuring what is blocking its access to enable communication between the pod and the Kubernetes API) for it to function.


Please ignore the The region can be manually supplied via the AWS_REGION environment variable." part of the error message. While the EBS CSI Driver controller pod can function with just the region being passed in, the node pod cannot.

I've already tried to set the hop limit to either 2 or 3, but still got the same error

here's what I've setup on terraform:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  eks_managed_node_groups = {
    default = {
      metadata_options = {
        "http_endpoint": "enabled",
        "http_put_response_hop_limit": 2,
        "http_tokens": "required"
      }
      ...
    }
  }
}

Btw do you know what port it used to communicate between the pod and Kubernetes API? I suspect it might be because I used CIS benchmark AMI, and they maybe blocked the ports. Since when I tried to use the usual Amazon Linux AMI there's no error happened.

do you know what port it used to communicate between the pod and Kubernetes API?

IMDS is reached via contacting the special IP 169.254.169.254 (where AWS makes IMDS available to EC2 instances) on TCP port 80 (the standard HTTP port).

I imagine the right outcome here is for the aws-ebs-csi-driver to add support for IMDSv2, especially now that AWS is pushing so hard for it and defaulting to disabling IMDSv1.

@dghubble The EBS CSI driver does support IMDSv2 and will use it if available, however the default IMDSv2 configuration prevents containers from accessing it.

You can give the EBS CSI Driver access by running it in host networking mode, or you can give all containers access (note: generally considered a security bad practice) by increasing IMDSv2's hop limit.

Facing the same issue . Any update in fix ?

The fix is to configure your cluster so that the EBS CSI Driver node pods have access to either IMDS or the Kubernetes API. Access to one of the two is a hard requirement for use of the EBS CSI Driver.

@ConnorJC3 These are AWS managed addons . How do we

  • configure individual pods have access to either IMDS or the Kubernetes API ?

from node level its working good when trying to reaching below Kubernetes API .

faced issue when moved from rhel 7 to rhel 9 ami only doe ebs add on and also core dns. on older rhel 7 its working good if reverted .

  • any iptables rules extra needed here in specific ?

EBS addon logs:

I0618 15:14:03.963302 1 main.go:135] Version: v2.10.1
I0618 15:14:03.963388 1 main.go:136] Running node-driver-registrar in mode=
I0618 15:14:03.963401 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0618 15:14:03.967563 1 main.go:164] Calling CSI driver to discover driver name
I0618 15:14:03.980756 1 main.go:173] CSI driver name: "ebs.csi.aws.com"
I0618 15:14:03.980829 1 node_register.go:55] Starting Registration Server at: /registration/ebs.csi.aws.com-reg.sock
I0618 15:14:03.982694 1 node_register.go:64] Registration Server started at: /registration/ebs.csi.aws.com-reg.sock
I0618 15:14:03.982909 1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0618 15:14:04.539319 1 main.go:90] Received GetInfo call: &InfoRequest{}
I0618 15:14:04.572753 1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I0618 15:14:06.731893 1 main.go:133] "Calling CSI driver to discover driver name"
I0618 15:14:06.737665 1 main.go:141] "CSI driver name" driver="ebs.csi.aws.com"
I0618 15:14:06.737702 1 main.go:170] "ServeMux listening" address="0.0.0.0:9808"
I0618 15:20:36.766742 1 ec2.go:40] "Retrieving EC2 instance identity metadata" regionFromSession=""
I0618 15:20:36.766905 1 metadata.go:52] "failed to retrieve instance data from ec2 metadata; retrieving instance data from kubernetes api" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, request canceled, context deadline exceeded"
E0618 15:21:06.770791 1 main.go:154] "Could not determine region from any metadata service. The region can be manually supplied via the AWS_REGION environment variable." err="error getting instance data from ec2 metadata or kubernetes api"
panic: error getting instance data from ec2 metadata or kubernetes api

goroutine 1 [running]:
main.main()
/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:155 +0xfb9


coredns logs:

[INFO] plugin/reload: Running configuration SHA512 = 8a7d59126e7f114ab49c6d2613be93d8ef7d408af8ee61a710210843dc409f03133727e38f64469d9bb180f396c84ebf48a42bde3b3769730865ca9df5eb281c
CoreDNS-1.9.3
linux/amd64, go1.20.4, c9dedfbf
[ERROR] plugin/errors: 2 3893335740271709553.8833067999452524873. HINFO: read udp 10.162.210.216:35664->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3893335740271709553.8833067999452524873. HINFO: read udp 10.162.210.216:60397->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3893335740271709553.8833067999452524873. HINFO: read udp 10.162.210.216:54032->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3893335740271709553.8833067999452524873. HINFO: read udp 10.162.210.216:42794->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3893335740271709553.8833067999452524873. HINFO: read udp 10.162.210.216:41396->10.162.128.2:53: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = 8a7d59126e7f114ab49c6d2613be93d8ef7d408af8ee61a710210843dc409f03133727e38f64469d9bb180f396c84ebf48a42bde3b3769730865ca9df5eb281c
CoreDNS-1.9.3
linux/amd64, go1.20.4, c9dedfbf
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:38294->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:59287->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:46973->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:34990->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:49502->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:54621->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:54628->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:50492->10.162.128.2:53: i/o timeout
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.20.0.1:443/version": dial tcp 172.20.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:59475->10.162.128.2:53: i/o timeout
[ERROR] plugin/errors: 2 3368869207192152190.3618717134880642323. HINFO: read udp 10.162.210.216:55621->10.162.128.2:53: i/o timeout
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.20.0.1:443/version": dial tcp 172.20.0.1:443: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s

Your logs likely indicate a networking issue, I would check if your pod networking (CNI plugin) is working.

@here

Since I need the cluster to be ready soon, I switched to using the Bottlerocket image, which also has the CIS Bottlerocket Benchmark Level 1 out of the box.

I will let you guys decide whether to close this issue or keep it open for discussion. Thanks.

/close

Because this does not appear to be a bug in the driver itself, and is rather an issue with the CIS image, I'm going to close this issue out. Please reopen this issue or create a new issue if further support is needed.

@ConnorJC3: Closing this issue.

In response to this:

/close

Because this does not appear to be a bug in the driver itself, and is rather an issue with the CIS image, I'm going to close this issue out. Please reopen this issue or create a new issue if further support is needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.