aws/amazon-cloudwatch-agent

IRSA forwarding of metrics to other aws accounts does not work in latest version of cloudwatch agent

ellen-lau opened this issue · 1 comments

Describe the bug

A couple months ago I was successfully using IRSA on ROSA to forward metrics from one aws account's cloudwatch (the aws account associated with the ROSA cluster running my application pods) to another secondary aws account's cloudwatch. However, after restarting the pod around a month ago (which grabs from amazon/cloudwatch-agent:latest), it looks like even with the proper IAM roles set up for metric forwarding, the metrics from my application pod are not being forwarded to the other aws account's cloudwatch -- it is only being sent to the cloudwatch in the aws account associated with the ROSA cluster running my application pods.

Reverting to version v1.247360.0 or image amazon/cloudwatch-agent:1.247360.0b252689 resolved the issue.

What did you expect to see?

I expected to see the metrics from my application pod forwarded to my secondary aws account.

What did you see instead?

I did not see any forwarding, and the metrics are only sent to the cloudwatch for the ROSA cluster's associated aws account.

What version did you use?

I see the issue with amazon/cloudwatch-agent:latest, but do not see it with image amazon/cloudwatch-agent:1.247360.0b252689.

What config did you use?

# create configmap for prometheus cwagent config
kind: ConfigMap
metadata:
  name: prometheus-cwagentconfig
  namespace: <namespace>
apiVersion: v1
data:
  # cwagent json config
  cwagentconfig.json: |
    {
      "agent": {
        "region": "us-east-1",
        "debug": true,
        "credentials": {
          "role_arn": "arn:aws:iam::<secondary_aws_account_id>:role/<role_name>"
        } 
      },
      "logs": {
        "metrics_collected": {
          "prometheus": {
            "cluster_name": "<namespace>",
            "log_group_name": "/aws/containerinsights/<namespace>/prometheus",
            "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
            "emf_processor": {
              "metric_declaration": [
                {"source_labels": ["job"],
                  "label_matcher": "^<namespace>-scrape-job$",
                  "dimensions": [["Namespace","job","pod_name"]],
                    "metric_selectors": [
                      <metric_selectors>
                    ]
                }
              ]
            }
          }
        },
        "force_flush_interval": 5
      }
    }


---
# create configmap for prometheus scrape config
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: <namespace>
apiVersion: v1
data:
  # prometheus config
  prometheus.yaml: |
    global:
      scrape_interval: 30s
      scrape_timeout: 10s
    scrape_configs:
      - job_name: '<namespace>-scrape-job'
        metrics_path: /metrics
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
              - <namespace>
        tls_config:
          insecure_skip_verify: true
        relabel_configs:
        - source_labels: [__address__]
          action: replace
          target_label: address
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - action: replace
          source_labels:
          - __meta_kubernetes_namespace
          target_label: Namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: pod_name
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_container_name
          target_label: container_name
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_controller_name
          target_label: pod_controller_name
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_container_port_name
          target_label: port_name
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_container_port_number
          target_label: port_number

        
---
# create cwagent service account and role binding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cwagent-prometheus
  namespace: <namespace>
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::<primary_aws_account_id>:role/<role_name>"

---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cwagent-prometheus-role-binding
subjects:
  - kind: ServiceAccount
    name: cwagent-prometheus
    namespace: <namespace>
roleRef:
  kind: ClusterRole
  name: cwagent-prometheus-role
  apiGroup: rbac.authorization.k8s.io

---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cwagent-prometheus
  namespace: <namespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cwagent-prometheus
  template:
    metadata:
      labels:
        app: cwagent-prometheus
    spec:
      containers:
        - name: cloudwatch-agent
          image: amazon/cloudwatch-agent:latest
          imagePullPolicy: Always
          resources:
            limits:
              cpu:  1000m
              memory: 1000Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: CI_VERSION
              value: "k8s/1.3.8"
            - name: RUN_WITH_IRSA
              value: "True"
          # Please don't change the mountPath
          volumeMounts:
            - name: prometheus-config
              mountPath: /etc/prometheusconfig
            - name: prometheus-cwagentconfig
              mountPath: /etc/cwagentconfig
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
        - name: prometheus-cwagentconfig
          configMap:
            name: prometheus-cwagentconfig
      terminationGracePeriodSeconds: 60
      serviceAccountName: cwagent-prometheus

Thank you for bringing this issue to our attention.

We found that the root cause is EMF exporter translator missing a statement to pass the RoleARN component from the agent configuration. @SaxyPandaBear linked the PR to address this issue and you can track that PR for the progress.