TalosOSv1.5.5: AWS CCM can't find the instance via the API so it can't configure the nodes in peer region

Question

TalosOSv1.5.5: AWS CCM can't find the instance via the API so it can't configure the nodes in peer region

Rammurthy5 opened this issue 8 months ago · 5 comments

I have a TalosOS v1.5.5 with kubespan enabled and CCM installed cluster.

What happened:
CCM should configure all the worker nodes in the cluster but it is not if kubespan enabled and peer regional nodes are present.

What you expected to happen:
CCM should configure all the worker nodes if they are part of single cluster, and reachable.

How to reproduce it (as minimally and precisely as possible):
Launch talosOS cluster following official documentation, and add options to enable externalLoadBalancer.

Anything else we need to know?:

E```
0122 13:23:49.763321 1 node_controller.go:236] error syncing 'ip-xxxx.region.compute.internal': failed to get provider ID for node ip-xxxx.region.compute.internal at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0122 13:23:49.780400 1 node_controller.go:427] Initializing node ip-xxxx.eu-west-1.compute.internal with cloud provider


**Environment**:
- Kubernetes version (use `kubectl version`): 1.29
- Cloud provider or hardware configuration: aws
- OS (e.g. from /etc/os-release): talos 1.5.5
- Kernel (e.g. `uname -a`):
- Install tools:
- Others:

<!-- DO NOT EDIT BELOW THIS LINE -->
/kind bug

Answer 1 · 2024-01-22T14:56:44.000Z

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Answer 2 · 2024-01-24T19:58:15.000Z

peer regional nodes are present.

Do you mean the nodes and CCM are in different regions?

Answer 3 · 2024-01-24T21:19:33.000Z

Control plane and worker nodes on two different regions which are Vpc peered and kubespanned.

Answer 4 · 2024-01-24T22:47:45.000Z

So, this particular failure is caused by CCM trying to call ec2:DescribeImages in the region it's running, for an instance in another region. I'd expect you to see more papercuts with this setup, because CCM assumes in many places that the AWS resources are in a single region. I don't think removing that assumption would be simple.

You could potentially run an instance of CCM in each region as a workaround.

Answer 5 · 2024-04-23T23:46:37.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale