terraform-ibm-modules/terraform-ibm-base-ocp-vpc

Error: cannot list resource "pods" in API group

vburckhardt opened this issue · 3 comments

Affected modules

Consumer reporting this:

2024/02/23 16:05:59 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0]: Creating...
2024/02/23 16:05:59 Terraform apply | 2024-02-23T16:05:59.963Z [INFO] Starting apply for module.roks-cluster.null_resource.confirm_network_healthy[0]
2024/02/23 16:05:59 Terraform apply | 2024-02-23T16:05:59.963Z [DEBUG] module.roks-cluster.null_resource.confirm_network_healthy[0]: applying the planned Create change
2024/02/23 16:05:59 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0]: Provisioning with 'local-exec'...
2024/02/23 16:05:59 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0] (local-exec): Executing: ["/bin/bash" "-c" ".terraform/modules/roks-cluster/scripts/confirm_network_healthy.sh"]
2024/02/23 16:05:59 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0] (local-exec): Running script to ensure kube master can communicate with all worker nodes..
2024/02/23 16:06:00 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0] (local-exec): Error from server (Forbidden): pods is forbidden: User "IAM#serviceid-XYZ" cannot list resource "pods" in API group "" in the namespace "calico-system"
2024/02/23 16:06:00 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0] (local-exec): Success! Master can communicate with all worker nodes.
2024/02/23 16:06:00 Terraform apply | module.roks-cluster.null_resource.confirm_network_healthy[0]: Creation complete after 0s [id=]

I suspect this is happening at

while IFS='' read -r line; do PODS+=("$line"); done < <(kubectl get pods -n "${namespace}" | grep calico-node | cut -f1 -d ' ')

Consumer state that serviceId has got necessary permission. Assuming this is correct, the issue could potentially be caused by delays in RBAC sync - would be good to double check if this lines is in the CI logs.

The other aspect is that the check does not happen if the kubectl get pods returns an error (but the message says success).

Terraform CLI and Terraform provider versions

  • Terraform version:
  • Provider version:

Terraform output

Debug output

Expected behavior

Actual behavior

Steps to reproduce (including links and screen captures)

  1. Run terraform apply

Anything else


By submitting this issue, you agree to follow our Code of Conduct

Setting admin = true on the data

data "ibm_container_cluster_config" "cluster_config" {
works around the issue. There is no RBAC sync involved in this scenario as it is impersonating system:admin default openshift user. This solution is relatively safe fine given that the data block is typically invoked by the same identity as the one who created the cluster, so in all. Only edge case may be on using different identities with -target on the applies.

I don't think we should use the admin flag - I think there may be audit concerns with that. Instead I think the fix is either to add a retry to the first kubectl command with a sleep, as RBAC sync is usually ready in a matter of seconds

Consumer is using admin flag - I think we should do the same. Admin config is pulled after auth on ibmcloud cli with an identity, so it should be possible to correlate if needed. If this becomes an issue from an audit perspective, some coordination would be needed around having that admin identity disabled through the ROKS stack, SCC scan tracking this, etc.