Lesson 102: Autoscalar in 'Not Ready' State for AWS EKS 1.25

Question

Lesson 102: Autoscalar in 'Not Ready' State for AWS EKS 1.25

Closed this issue a year ago · 5 comments

Hi Anton,

I have created the AWS EKS resources based on this and the cluster that gets created has version 1.25.

I'm also able to apply all files from this directory after updating the role ARN, container image to registry.k8s.io/autoscaling/cluster-autoscaler:v1.25.1, and cluster name

But when I do kubectl get all -A -n kube-system, I get:

NAMESPACE     NAME                           READY   STATUS    RESTARTS   AGE
default       pod/nginx-7f85bb5c99-77kdl     1/1     Running   0          153m
kube-system   pod/aws-node-drj5z             1/1     Running   0          21h
kube-system   pod/coredns-7975d6fb9b-b9nrw   1/1     Running   0          21h
kube-system   pod/coredns-7975d6fb9b-qdk4r   1/1     Running   0          21h
kube-system   pod/kube-proxy-86tw4           1/1     Running   0          21h

NAMESPACE     NAME                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)         AGE
default       service/kubernetes   ClusterIP      172.20.0.1       <none>                                                                          443/TCP         21h
default       service/private-lb   LoadBalancer   1.2.3.4          long-text.elb.us-east-1.amazonaws.com   80:30528/TCP    137m
default       service/public-lb    LoadBalancer   1.2.3.4          long-text.elb.us-east-1.amazonaws.com   80:32750/TCP    137m
kube-system   service/kube-dns     ClusterIP      172.20.0.10      <none>                                                                          53/UDP,53/TCP   21h

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/aws-node     1         1         1       1            1           <none>          21h
kube-system   daemonset.apps/kube-proxy   1         1         1       1            1           <none>          21h

NAMESPACE     NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
default       deployment.apps/nginx                1/1     1            1           153m
kube-system   deployment.apps/cluster-autoscaler   0/1     0            0           8m58s
kube-system   deployment.apps/coredns              2/2     2            2           21h

NAMESPACE     NAME                                            DESIRED   CURRENT   READY   AGE
default       replicaset.apps/nginx-7f85bb5c99                1         1         1       153m
kube-system   replicaset.apps/cluster-autoscaler-6cf6d855c5   1         0         0       8m58s
kube-system   replicaset.apps/coredns-7975d6fb9b              2         2         2       21h

And when I try to tail the log by kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler, it times out with error: timed out waiting for the condition.

How can I fix this error?

antonputra commented a year ago

Merged

Answer 1 · 2023-04-07T00:24:16.000Z

Additionally, based on the AWS autoscaling webpage I modified this file to:

data "aws_iam_policy_document" "eks_cluster_autoscaler_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    effect  = "Allow"

    condition {
      test     = "StringEquals"
      variable = "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub"
      values   = ["system:serviceaccount:kube-system:cluster-autoscaler"]
    }

    principals {
      identifiers = [aws_iam_openid_connect_provider.eks.arn]
      type        = "Federated"
    }
  }
}


resource "aws_iam_role" "eks_cluster_autoscaler" {
  assume_role_policy = data.aws_iam_policy_document.eks_cluster_autoscaler_assume_role_policy.json
  name               = "eks-cluster-autoscaler"
}

resource "aws_iam_policy" "eks_cluster_autoscaler" {
  name = "eks-cluster-autoscaler"

  policy = jsonencode({
    Statement = [
      {
        Action = [
          "autoscaling:DescribeAutoScalingInstances",
          "autoscaling:DescribeAutoScalingGroups",
          "ec2:DescribeLaunchTemplateVersions",
          "autoscaling:DescribeTags",
          "autoscaling:DescribeLaunchConfigurations",
          "ec2:DescribeInstanceTypes",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
      {
        Action = [
          "autoscaling:SetDesiredCapacity",
          "autoscaling:TerminateInstanceInAutoScalingGroup",
        ]
        Effect = "Allow"
        Resource = "*"
        Condition = {
          "StringEquals" = {
            "aws:ResourceTag/k8s.io/cluster-autoscaler/${var.cluster_name}": "owned"
          }
        }
      }
    ]
    Version = "2012-10-17"
  })
}

resource "aws_iam_role_policy_attachment" "eks_cluster_autoscaler_attach" {
  role       = aws_iam_role.eks_cluster_autoscaler.name
  policy_arn = aws_iam_policy.eks_cluster_autoscaler.arn
}

output "eks_cluster_autoscaler_arn" {
  value = aws_iam_role.eks_cluster_autoscaler.arn
}

But still seeing the same error.

Answer 2 · 2023-04-07T01:25:24.000Z

Figured out the fix!
Performed a clean run with the platform and deployments working correctly.

This is my pull request: #155

Answer 3 · 2023-04-07T01:52:10.000Z

Thanks for the PR, it's slightly outdated (my code)

Answer 4 · 2023-04-07T01:55:11.000Z

Thank you!