amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials

Question

amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials

ecerulm opened this issue 7 months ago · 13 comments

Describe the bug
When IDMSv2 is enabled in the worker nodes with hop limit 1 , the IMDSv2 is not accessible from the pods. In general, I don't want pods to access IMDS since they can get credentials for the node IAM role.

When IDMSv2 is no accessible it seems that the cloudagent (i'm using amazon-cloudwatch-observability eks addon) , tries to use credentials from the non existing file /root/.aws/credentials instead of using the credentials from IRSA. The pod uses a service account with IRSA annotation and it was the environment variables AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE (injected by IRSA).

But I believe the amazon-cloudwatch-agent is ignoring the IRSA credentials (I suspect it it's because IMDSv2 is not available , and the it decides it "onprem")

I see on startup of the pod

D! [EC2] Found active network interface
I! imds retry client will retry 1 timesD! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get hostname without imds v1 fallback enable thus enable fallback
E! [EC2] Fetch hostname from EC2 metadata fail: EC2MetadataError: failed to make EC2Metadata request
 status code: 401, request id: 
D! should retry true for imds error : RequestError: send request failed

...

I! Detected the instance is OnPremise

Steps to reproduce
If possible, provide a recipe for reproducing the error.

EKS 1.29
EKS nodes 1.29 bottlerocket
with IMDSv2 (http tokens required, hop limit 1)
amazon-cloudwatch-observability eks addon v1.4.0-eksbuild.1 (default config)

What did you expect to see?

I expect to allow me to override "OnPrem" / "EC2" from the eks addon configuration , I don't see that as possibility in the amazon-cloudwatch-observability addon,

 aws eks describe-addon-configuration --addon-name amazon-cloudwatch-observability --addon-version "v1.4.0-eksbuild.1" --query configurationSchema | jq '.|fromjson'

What did you see instead?

I see that it detects "OnPremise" and I believe that in turn forces it to use /root/.aws/credentials when in fact it should be using the credentials from IRSA via the existing environment variables $AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE.

What version did you use?
Version: (e.g., v1.247350.0, etc)

I'm using amazon-cloudwatch-observability eks addon version v1.4.0-eksbuild.1, I don't know which version of the amazon-cloudwatch-agent is indluded with that

What config did you use?
Config: (e.g. the agent json config file)

Environment
EKS 1.29
EKS nodes 1.29 bottlerocket
with IMDSv2 (http tokens required, hop limit 1)
amazon-cloudwatch-observability eks addon v1.4.0-eksbuild.1, (default config)

Additional context
Add any other context about the problem here.

Answer 1 · 2024-03-25T11:42:26.000Z

facing the same while runing cloudwatch agent as a daemonset.

D! [EC2] Found active network interface
I! imds retry client will retry 1 timesD! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get hostname without imds v1 fallback enable thus enable fallback
E! [EC2] Fetch hostname from EC2 metadata fail: EC2MetadataError: failed to make EC2Metadata request

	status code: 401, request id: 
D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get instance document without imds v1 fallback enable thus enable fallback

Answer 2 · 2024-03-25T13:55:33.000Z

I uninstalled the amazon-cloudwatch-observability eks add-on and installed using the the instructions at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-metrics.html
and I'm getting the same result.

But I can set RUN_WITH_IRSA="True" (I actually have a service account with IRSA annotation) in the DaemonSet and that makes it detect

I! Detected from ENV RUN_WITH_IRSA is True

The RUN_WITH_IRSA environment variable does not seem to be documented but it's in the source code and it works the values needs to be True (not true or 1).

Answer 3 · 2024-03-25T15:53:46.000Z

Although, using RUN_WITH_IRSA allows the DaemonSet to run and some metrics are sent to CloudWatch . I can see that it still tries to use the IMDS to get the EC2 metadata (instance id, image id, and instance type) I guess it's not possible to get those in any other way currently

The CloudWatch Container Insights still lacks the "Top 10 Nodes by CPU Utilization" , etc. I guess all metrics that have NodeName as dimension are missing.

I guess this means that amazon-cloudwatch-agent really needs IMDS , and maybe it should documented so. You can't have it without it, can you?


2024-03-25T14:14:10Z I! {"caller":"host/ec2metadata.go:78","msg":"Fetch instance id and type from ec2 metadata","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics"}
2024-03-25T14:14:11.425Z	DEBUG	aws@v0.0.0-20231208183748-c00ca1f62c3e/imdsretryer.go:45	imds error : 	{"shouldRetry": true, "error": "RequestError: send request failed\ncaused by: Put \"http://169.254.169.254/latest/api/token\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

Answer 4 · 2024-03-26T09:07:52.000Z

Restrict the use of host networking and block access to instance metadata service
sort of recommends blocking IMDS from the pods:

While these privileges are required for the node to operate effectively, it is not usually desirable that the pods running on the node inherit these privileges.

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

The alternatives are

Increase IMDSv2 hop limit to 2 (that allows all pods to get access to IMDS and use the node IAM role)
Increase IMDSv2 hop limit and add k8s NetworkPolicy to each Namespace to block access to 169.254.0.0/16, this is cumbersome since you need to add it to all namespaces (there is no GlobalNetworkPolicy in vanilla Kubernetes) and there is no explicity deny either.
Keep IMDSV2 hop limit 1 and run cloudwatch agent daemonset with pod.spec.hostNetwork = true
- you need to enforce that untrusted pods do not use host networking, with admission controllers

Answer 5 · 2024-03-29T16:55:13.000Z

Hi @ecerulm, we are aware of the current IMDS requirement and are tracking an alternative for it when IMDS is unavailable internally.

Answer 6 · 2024-04-10T18:44:00.000Z

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

Answer 7 · 2024-04-10T19:28:30.000Z

@AaronFriel

Like I commented at #1101 (comment) even with RUN_WITH_IRSA it still goes to IMDS to obtain the instance id, etc

Although, using RUN_WITH_IRSA allows the DaemonSet to run and some metrics are sent to CloudWatch . I can see that it still tries to use the IMDS to get the EC2 metadata (instance id, image id, and instance type)

The instance id, etc are needed metrics for “ kubernetes container insights with enhanced observability (enhanced_container_insights)” and since they can’t be obtained those metric are not sent.

I don’t think there is anyway to pass the instance id, etc by any other means today but @jefchien seems to be indicating that there may be working in some alternative.

Answer 8 · 2024-06-11T14:58:37.000Z

Any ETA on this fix?

Answer 9 · 2024-07-11T09:52:42.000Z

I don’t think there is anyway to pass the instance id, etc by any other means today but @jefchien seems to be indicating that there may be working in some alternative.

Maybe the agent can grab the instance ID from the spec.providerID field on the Node object which it can be fetched from the kube-api?

Answer 10 · 2024-07-23T11:19:48.000Z

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

With this change applied to the agent, the issue still persists.

As mentioned in another comment, I created a custom launch template with an increased number of max hops, which solved the issue. I do understand, however, that this may be a security concern and should be avoided, but, as a temporary measure until the addon is fixed, it is acceptable for our use case.

Answer 11 · 2024-07-30T14:28:36.000Z

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:
$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com 
Apply this change:
 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName
With this change applied to the agent, the issue still persists.

As mentioned in another comment, I created a custom launch template with an increased number of max hops, which solved the issue. I do understand, however, that this may be a security concern and should be avoided, but, as a temporary measure until the addon is fixed, it is acceptable for our use case.

The workroud is valid. Need to write the "True" value starting with uppercase.

Answer 12 · 2024-07-31T06:56:24.000Z

I used this helm chart to deploy the add-on:
https://github.com/aws-observability/helm-charts

Modifying amazon-cloudwatch-observability/templates/linux/cloudwatch-agent-daemonset.yaml like below solved the issue.

apiVersion: cloudwatch.aws.amazon.com/v1alpha1
kind: AmazonCloudWatchAgent
metadata:
  name: {{ template "cloudwatch-agent.name" . }}
  namespace: {{ .Release.Namespace }}
spec:
+ hostNetwork: true
  image: {{ template "cloudwatch-agent.image" . }}
  mode: daemonset
  ...
  env:
+ - name: RUN_WITH_IRSA
+   value: "True"
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  ...

Although this is not required, I configured Gatekeeper to restrict host network access exclusive to CloudWatch Agent pods for enhanced security.

contraintTemplate.yaml:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sallowedhostnetworking
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedHostNetworking
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedhostnetworking

        default allow = false

        allow {
          input.review.object.metadata.labels["app.kubernetes.io/name"] == "cloudwatch-agent"
        }

        violation[{"msg": msg}] {
          not allow
          input.review.object.spec.hostNetwork == true
          msg := "Host network is not allowed for this pod"
        }

constraint.yaml:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedHostNetworking
metadata:
  name: allowed-host-networking
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]

Answer 13 · 2024-10-17T13:33:44.000Z

Thank you @kwangjong

The official EKS addon with a 1.30 cluster does not work (it produces the local credentials file problem). Your solution based upon https://github.com/aws-observability/helm-charts has worked for me.