aws cloud controller manager is unable to manage the nodes in cluster
karty-s opened this issue · 9 comments
What happened: We are running k8s cluster of version 1.26 using kubeadm with resources from aws. We wanted to upgrade our clusters to 1.28 (1.26->1.27->1.28) as per update notes we tried to move from in-tree aws cloud provider to external aws cloud provider.
As per the upgrade process we deployed the new 1.27 nodes along with aws cloud controller manager in the cluster, post which we scaled down the 1.26 nodes.
What you expected to happen: The issue we face is that the etcd and worker nodes of 1.26 version which is scaled down gets removed from the cluster, but the control plane nodes still shows up in the cluster even after its ec2 instance is removed. eg -
NAME STATUS ROLES AGE VERSION
ip-.ec2.internal Ready,SchedulingDisabled control-plane,master 96m v1.26.7
ip-.ec2.internal Ready etcd 11m v1.27.13
ip-.ec2.internal Ready etcd 9m10s v1.27.13
ip-.ec2.internal Ready control-plane,master 5m59s v1.27.13
ip-.ec2.internal Ready,SchedulingDisabled control-plane,master 95m v1.26.7
ip-.ec2.internal Ready node 6m12s v1.27.13
ip-.ec2.internal Ready etcd 14m v1.27.13
ip-.ec2.internal Ready control-plane,master 6m1s v1.27.13
ip-.ec2.internal Ready node 6m9s v1.27.13
ip-.ec2.internal Ready node 6m14s v1.27.13
ip-.ec2.internal Ready node 6m15s v1.27.13
ip-.ec2.internal Ready,SchedulingDisabled control-plane,master 96m v1.26.7
ip-.ec2.internal Ready node 6m15s v1.27.13
ip-.ec2.internal Ready node 6m15s v1.27.13
ip-.ec2.internal Ready control-plane,master 5m43s v1.27.13
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
we are seeing this error in the cloud controller manager pod logs -
I0516 08:13:24.811572 1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-10-230-13-35.ec2.internal
I0516 08:13:24.812083 1 event.go:307] "Event occurred" object="ip-10-230-13-35.ec2.internal" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-10-230-13-35.ec2.internal because it does not exist in the cloud provider"
we have set the hostname according to the pre req but still we get this
Environment: kubeadm
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7", GitCommit:"84e1fc493a47446df2e155e70fca768d2653a398", GitTreeState:"clean", BuildDate:"2023-07-19T12:23:27Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
-
Cloud provider or hardware configuration: aws
-
OS (e.g. from /etc/os-release):
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3374.2.4
VERSION_ID=3374.2.4
BUILD_ID=2023-02-15-1824
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)"
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
/kind bug
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This:
I0516 08:13:24.811572 1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-10-230-13-35.ec2.internal
Isn't an error, it's expected behavior when a Node
becomes NotReady
and the corresponding EC2 instance is terminated (or doesn't exist). Are you sure the EC2 instances for your old 1.26 control plane nodes have been terminated? They wouldn't have a Ready
status if the kubelet
stopped heartbeating.
@cartermckinnon We have followed below steps on existing 1.26 cluster to make it ready for 1.27 upgrade
On existing version 1.26
Add tag to each node [kubernetes.io/cluster/cluster-name: owned
k edit cm kubeadm-config -n kube-system to update cloud-provider=external
Update existing master kube-controller and kube-apiserver manifest to use cloud-provider=external
Made aws-controller-manager running
Now when upgrading cluster to 1.27, below are the issues which we are facing:-
- ProviderID is not getting displayed when new node joined the cluster.
- When ASG delete old nodes specially old control-plane nodes , node controller not deleting terminated node.
Please let us know what step we are missing and what is the correct method to go out-off tree aws CCM for k8s upgrade to 1.27 .
Are you passing --cloud-provider=external
to kubelet
as well?
CCM should fill in the provider ID if it's missing, but it's generally preferable to just pass it to kubelet
to avoid extra API calls in CCM. The EKS AMI uses this helper script to set it: https://github.com/awslabs/amazon-eks-ami/blob/f5111dd100ebd94d9fbfbb1fe2f43b75fd1a6703/templates/al2/runtime/bin/provider-id
@cartermckinnon Let me share you 10-kubeadm-conf and kubeadm-config which we currently have in 1.26 where in tree support is there :-
10-kubeam-conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
Environment="KUBELET_EXTRA_ARGS=--cloud-provider=aws --node-labels=node.kubernetes.io/role=${kind},instance-group=${group_name},${extra_labels} --register-with-taints=${taints} --cert-dir=/etc/kubernetes/pki --cgroup-driver=systemd"
# Environment="KUBELET_KUBEADM_ARGS=--feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true --rotate-certificates"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/opt/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
kubeadm-config
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
certSANs:
- "api-int.${cluster_fqdn}"
- "api.${cluster_fqdn}"
extraArgs:
anonymous-auth: "true"
audit-log-maxage: "7"
audit-log-maxbackup: "50"
audit-log-maxsize: "100"
audit-log-path: /var/log/kube-apiserver-audit.log
audit-policy-file: /etc/kubernetes/files/audit-log-policy.yaml
authorization-mode: Node,RBAC
cloud-provider: aws
max-mutating-requests-inflight: "400"
max-requests-inflight: "800"
oidc-client-id: "${dex_oidc_client_id}"
oidc-groups-claim: "${dex_oidc_groups_claim}"
oidc-issuer-url: "${dex_oidc_issuer_url}"
oidc-username-claim: "${dex_oidc_username_claim}"
profiling: "false"
request-timeout: 30m0s
service-account-lookup: "true"
tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
extraVolumes:
- hostPath: /etc/kubernetes/files
mountPath: /etc/kubernetes/files
name: cloud-config
readOnly: true
- hostPath: /var/log
mountPath: /var/log
name: var-log
readOnly: false
timeoutForControlPlane: 10m0s
certificatesDir: /etc/kubernetes/pki
clusterName: "${cluster_fqdn}"
controlPlaneEndpoint: "${api_endpoint}:${api_port}"
controllerManager:
extraArgs:
cluster-signing-cert-file: /etc/kubernetes/pki/ca.crt
cluster-signing-key-file: /etc/kubernetes/pki/ca.key
feature-gates: RotateKubeletServerCertificate=true
profiling: "false"
terminated-pod-gc-threshold: "12500"
configure-cloud-routes: "false"
cluster-name: "${cluster_fqdn}"
attach-detach-reconcile-sync-period: "1m0s"
cloud-provider: "aws"
{{- if contains "1.15" .Kubernetes.Version | not }}
flex-volume-plugin-dir: "/var/lib/kubelet/volumeplugins/"
{{- end }}
dns:
type: CoreDNS
etcd:
${etcd_type}:
endpoints:
${endpoints}
caFile: ${etcd_cafile}
certFile: "/etc/kubernetes/pki/apiserver-etcd-client.crt"
keyFile: "/etc/kubernetes/pki/apiserver-etcd-client.key"
imageRepository: registry.k8s.io
kubernetesVersion: "${k8s_version}"
networking:
dnsDomain: cluster.local
podSubnet: "${pod_subnet}"
serviceSubnet: "100.64.0.0/13"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
bootstrapTokens:
- token: "${kubeadm_token}"
description: "kubeadm bootstrap token"
ttl: "43800h"
nodeRegistration:
criSocket: "unix:///var/run/containerd/containerd.sock"
kubeletExtraArgs:
container-runtime: remote
container-runtime-endpoint: unix:///run/containerd/containerd.sock
ignorePreflightErrors:
- IsPrivilegedUser
localAPIEndpoint:
bindPort: 443
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
Now we are planning to opt out off tree aws cloud controller manager, Could you please guide us what changes we need to make to migrate from in-tree to out-tree . Currently we have deployed aws-cloud-controllermanager daemonset and those are running. But kube-controller-manager also running with above configurations.