kubernetes/kops

Kops on a disconnected environment

Opened this issue · 5 comments

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.26.3

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.26.4

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Manage your own security group and allow egress traffic only for internal communication ( block 0.0.0.0/0 and allow vpc cidr)

 kops update cluster **** --yes --lifecycle-overrides SecurityGroup=Ignore,SecurityGroupRule=Ignore

5. What happened after the commands executed?
exceed timeout

6. What did you expect to happen?
When ssh into the master node, the nodeup process exit's with the following error :

Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.305209    1035 s3context.go:192] unable to get bucket location from region "us-east-1"; scanning all regions: RequestError: send request failed
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: caused by: Get "https://s3.dualstack.us-east-1.amazonaws.com/r*****?location=": dial tcp 52.217.230.168:443: i/o timeout
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.374846    1035 s3context.go:298] Querying S3 for bucket location for ****
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.374904    1035 s3context.go:303] Doing GetBucketLocation in "eu-west-3"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.374911    1035 s3context.go:303] Doing GetBucketLocation in "us-west-2"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.374930    1035 s3context.go:303] Doing GetBucketLocation in "eu-west-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.375066    1035 s3context.go:303] Doing GetBucketLocation in "ca-central-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378346    1035 s3context.go:303] Doing GetBucketLocation in "ap-northeast-3"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378520    1035 s3context.go:303] Doing GetBucketLocation in "us-east-2"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378718    1035 s3context.go:303] Doing GetBucketLocation in "eu-south-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378767    1035 s3context.go:303] Doing GetBucketLocation in "us-west-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378885    1035 s3context.go:303] Doing GetBucketLocation in "eu-central-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378406    1035 s3context.go:303] Doing GetBucketLocation in "ap-south-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378418    1035 s3context.go:303] Doing GetBucketLocation in "eu-north-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378439    1035 s3context.go:303] Doing GetBucketLocation in "ap-northeast-2"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378454    1035 s3context.go:303] Doing GetBucketLocation in "ap-northeast-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378472    1035 s3context.go:303] Doing GetBucketLocation in "us-east-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378481    1035 s3context.go:303] Doing GetBucketLocation in "sa-east-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378490    1035 s3context.go:303] Doing GetBucketLocation in "ap-southeast-1"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.378498    1035 s3context.go:303] Doing GetBucketLocation in "ap-southeast-2"
Apr  5 08:03:24 ip-172-20-10-182 nodeup[1035]: I0405 08:03:24.379255    1035 s3context.go:303] Doing GetBucketLocation in "eu-west-2"
Apr  5 08:03:29 ip-172-20-10-182 nodeup[1035]: W0405 08:03:29.375004    1035 main.go:133] got error running nodeup (will retry in 30s): error loading Cluster "s3://****/******/cluster-completed.spec": Could not retrieve location for AWS bucket *****

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-04-05T07:23:05Z"
  name: ********
spec:
  additionalPolicies: {}
  api:
    loadBalancer:
      class: Classic
      securityGroupOverride: sg-*****
      type: Public
  assets:
    containerRegistry: *******.dkr.ecr.us-east-1.amazonaws.com/kops
    fileRepository: https://s3.us-east-1.amazonaws.com/******
  authorization:
    rbac: {}
  cloudProvider: aws
  configBase: s3://*****/******
  containerd:
    configOverride: |2
            version = 2
            [plugins]
              [plugins."io.containerd.grpc.v1.cri"]
                sandbox_image = "*****.dkr.ecr.us-east-1.amazonaws.com/kops/pause:3.9@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097"
              [plugins."io.containerd.grpc.v1.cri".registry.mirrors."*******.dkr.ecr.us-east-1.amazonaws.com"]
                endpoint = ["https://******.dkr.ecr.us-east-1.amazonaws.com"]
                [plugins."io.containerd.grpc.v1.cri".registry.configs."******.dkr.ecr.us-east-1.amazonaws.com".auth]
                  username = "AWS"
                  password = "******"
                [plugins."io.containerd.grpc.v1.cri".containerd]
                  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                      runtime_type = "io.containerd.runc.v2"
                      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                        SystemdCgroup = true
  dnsZone: *****
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-1
      name: master-1
    name: main
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeProxy:
    enabled: true
  kubelet:
    anonymousAuth: false
  kubernetesVersion: 1.26.4
  masterPublicName: api.*****
  networkCIDR: 172.20.0.0/16
  networkID: vpc-*****
  networking:
    calico: {}
  nodeTerminationHandler:
    enableSpotInterruptionDraining: false
    enabled: false
  nonMasqueradeCIDR: 100.64.0.0/10
  sshKeyName: *****
  subnets:
  - cidr: 172.20.10.0/24
    id: subnet-*****
    name: us-east-1b
    type: Public
    zone: us-east-1b
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-05T07:23:08Z"
  labels:
    kops.k8s.io/cluster: *****
  name: master-1
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240126
  kubelet:
    anonymousAuth: false
    nodeLabels:
      kops.k8s.io/kops-controller-pki: ""
      node-role.kubernetes.io/control-plane: ""
      node.kubernetes.io/exclude-from-external-load-balancers: ""
    taints:
    - node-role.kubernetes.io/control-plane=:NoSchedule
  machineType: m5.xlarge
  manager: CloudGroup
  maxSize: 1
  minSize: 1
  role: Master
  securityGroupOverride: ******
  subnets:
  - us-east-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-05T07:23:08Z"
  labels:
    kops.k8s.io/cluster: *****
  name: node
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240126
  kubelet:
    anonymousAuth: false
    nodeLabels:
      node-role.kubernetes.io/node: ""
  machineType: c6i.2xlarge
  manager: CloudGroup
  maxSize: 2
  minSize: 2
  nodeLabels:
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
    nvidia.com/gpu.deploy.device-plugin: "true"
  packages:
  - nfs-common
  role: Node
  securityGroupOverride: sg-*****
  subnets:
  - us-east-1b

I have created a VPC endpoint for S3 with an Interface type, but all of the DNS records do not include the dualstack.

*.vpce-*****.s3.us-east-1.vpce.amazonaws.com
*.vpce-*****-us-east-1b.s3.us-east-1.vpce.amazonaws.com
s3.us-east-1.amazonaws.com
*.s3.us-east-1.amazonaws.com
*.s3-accesspoint.us-east-1.amazonaws.com
*.s3-control.us-east-1.amazonaws.com

Its not clear for me how this is kops bug?

There is no way to setup kops for disconnected env... i can open a feature request if you want to

there is way to install kops in disconnected environment. However, you must copy all assets first. It can be installed without any internet connectivity, you just need to have connectivity to single object storage.

https://kops.sigs.k8s.io/operations/asset-repository/

also you need to use kops channel: none (I cannot see this in your spec at all.. so its not none in that case. Default value is stable)

@zetaab Although I have added all assets files and containers into s3 and ECR and configured kops to use it, when looking at the nodeup logs I can see an error when trying to retrieve the s3 cluster-completed.spec even if I configure a s3 vpc endpoint.

That's because kops using the s3://bucket-name schema and the s3 vpc endpoint use the full s3 DNS name (bucket-name.s3.us-east-1.amazonaws.com).

As a result, kops cannot be used in a disconnected environment on AWS

W0412 06:49:07.558115    1040 main.go:133] got error running nodeup (will retry in 30s): error loading Cluster "s3://kops-state-****/*****/cluster-completed.spec": file does not exist