awslabs/aws-orbit-workbench

[BUG] -POD imagereplication-operator stack in Init:0/1

bozethe opened this issue · 8 comments

Describe the bug
I deployed the orbit environment stack using (orbit deploy env -f default-manifest.yaml) however the pipeline fails waiting for eks pods to be ready. All the pods get ready(please see the attached screenshots) except one(imagereplication-operator). It doesn't finish initializing. It stays on "Init:0/1". I have included more logs at the end of this.

To Reproduce
https://awslabs.github.io/aws-orbit-workbench/deploy-steps

  1. Clone AWS Labs github repository
  2. Install the CLI
    only once
  3. Install AWS CodeSeeder
    only once
  4. Generate a new manifest
    once created, you will add / remove from this manifest as your platform changes
  5. Deploy a new foundation
    you may have an existing foundation (VPC, Subnets, EFS, and Cognito) that can be leveraged
    this is OPTIONAL if you have the necessary components
  6. Deploy a new toolkit
    only once
  7. Deploy credentials
    only once
    this is OPTIONAL
  8. Deploy docker images
    once deployed, you may deploy one or all the base workbench images as needed
  9. Deploy environment

Expected behavior
I am expecting to see pod reporting running


orbit-system imagereplication-operator-658d9cdf94-jwfx5 1/1 Init:1/1


Screenshots
If applicable, add screenshots to help explain your problem.
kubectl get pods -A
image
kubectl log
image
kube describe pod
image

Additional context
{"type":"Recreate"},"revisionHistoryLimit":10,"progressDeadlineSeconds":600},"status":{"observedGeneration":1,"replicas":1,"updatedReplicas":1,"unavailableReplicas":1,"conditions":[{"type":"Available","status":"False","lastUpdateTime":"2022-07-21T13:29:18Z","lastTransitionTime":"2022-07-21T13:29:18Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2022-07-21T13:47:21Z","lastTransitionTime":"2022-07-21T13:47:21Z","reason":"ReplicaSetUpdated","message":"ReplicaSet "imagereplication-operator-658d9cdf94" is progressing."}]}}

[2022-07-21 13:50:21,122][kubectl.py :592] orbit-system/imagereplication-operator not yet ready, sleeping for 1 minute
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 645, in deploy_env
name="imagereplication-operator", namespace="orbit-system", type="deployment", k8s_context=k8s_context
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 595, in _confirm_readiness
raise Exception("Timeout wating for Image Replicator to become ready")
Exception: Timeout wating for Image Replicator to become ready

[Container] 2022/07/21 13:51:21 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 13:51:21 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 13:51:21 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 13:51:21 Entering phase POST_BUILD
[Container] 2022/07/21 13:51:21 Running command . ~/.venv/bin/activate

[Container] 2022/07/21 13:51:21 Running command cd ${CODEBUILD_SRC_DIR}/bundle

[Container] 2022/07/21 13:51:21 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 13:51:21 Phase context status code: Message:

getting the full yaml or json definition of the Pod will also return the Status which may have additional info. can you run:

kubectl get pods -n orbit-system -o json imagereplication-operator-658d9cdf94-iwfx5

or

kubectl get pods -n orbit-system -o yaml imagereplication-operator-658d9cdf94-iwfx5

also, it might be worth trying to delete the pod and let the Deployment/ReplicaSet recreate it:

kubectl delete pods -n orbit-system imagereplication-operator-658d9cdf94-iwfx5

are you deploying orbit into isolated subnets? the image-replication operator and webhook should only be deployed if orbit is deployed in an isolated environment.

Below is our manifest file. Could you please confirm if its fine? I believe our private is more like an isolated subnet since it doesn't have route to 0.0.0.0/0. (see below).
image

Name: orbit
ScratchBucketArn: arn:aws:s3:::orbit-poc0000
UserPoolId: eu-central-000000000
SharedEfsFsId: fs-00000000
SharedEfsSgId: sg-00000000
Networking:
VpcId: vpc-00000000000
PublicSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
PrivateSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Data:
InternetAccessible: false
NodesSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Frontend:
LoadBalancersSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
#SslCertArn: !SSM ${/orbit-f/orbit/resources::SslCertArn}
Images:
JupyterUser:
Repository: 0000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/jupyter-user
Version: latest
OrbitController:
Repository: 00000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/orbit-controller
Version: latest
UtilityData:
Repository: 0000000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/utility-data
Version: latest
Teams:

  • Name: sample-admin
    Policies:
    • None
      GrantSudo: true
      Fargate: true
      K8Admin: true
      JupyterhubInboundRanges:
    • 0.0.0.0/0
      EfsLifeCycle: AFTER_7_DAYS
      Plugins: !include common_plugins.yaml
      AuthenticationGroups:
    • sample-admin

I checked the pods again now and they seems to be running all of them. (see below)
image

However, When I run the stack again (orbit deploy env) it now fails on a different error. (see below)

[2022-07-21 15:46:53,380][k8s.py : 37] Endpoint Subsets: [{'addresses': [{'hostname': None, 'ip': '10.10.10.112', 'node_name': 'ip-10-19-15-31.eu-central-1.compute.internal', 'target_ref': {'api_version': None, 'field_path': None, 'kind': 'Pod', 'name': 'imagereplication-pod-webhook-58df9d9cb4-xtb9s', 'namespace': 'orbit-system', 'resource_version': '9885', 'uid': '3a4560f2-fe0b-47b3-ab3f-5459a9c5f5b8'}}], 'not_ready_addresses': None, 'ports': [{'name': 'https', 'port': 443, 'protocol': 'TCP'}]}]
[2022-07-21 15:46:53,380][kubectl.py :578] Service: imagereplication-pod-webhook Namespace: orbit-system Hostname: None IP: 10.19.13.112
[2022-07-21 15:46:53,380][sh.py : 28] + kubectl rollout restart daemonsets -n orbit-system-ssm-daemons ssm-agent-installer --context AWSCodeBuild-a372ab74-b1d3-478b-871d-546f9f18a0ec@orbit-orbit.eu-central-1.eksctl.io
[2022-07-21 15:46:53,486][sh.py : 30] Error from server (NotFound): namespaces "orbit-system-ssm-daemons" not found
[2022-07-21 15:46:53,487][sh.py : 30]
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 649, in deploy_env
"kubectl rollout restart daemonsets -n orbit-system-ssm-daemons "
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 29, in run
for line in _run_iterating(cmd=cmd, cwd=cwd):
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 23, in _run_iterating
raise FailedShellCommand(f"Exit code: {p.returncode}")
aws_orbit.exceptions.FailedShellCommand: Exit code: 1

[Container] 2022/07/21 15:46:53 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 15:46:53 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 15:46:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 15:46:53 Entering phase POST_BUILD
[Container] 2022/07/21 15:46:53 Running command . ~/.venv/bin/activate

[Container] 2022/07/21 15:46:53 Running command cd ${CODEBUILD_SRC_DIR}/bundle

[Container] 2022/07/21 15:46:53 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 15:46:53 Phase context status code: Message:

ok. InternetAccessible: false in your manifest does flag your subnets as "isolated", meaning there's no route to the internet. as you said, you have no route to 0.0.0.0/0. we don't get a lot of deployments set up like that, there does appear to be a bug in the deployment that will fail if InternetAccessible: false and the SSM Agent isn't installed on the Nodes. also, you don't have a compute NodeGroup defined in your manifest, so there won't be any nodes to deploy user notebooks on. I recommend adding the following to your manifest which will create a NodeGroup (customize as you see fit) and force installation of the SSM Agent on the Nodes (a recommended best practice as it it can allow SSM to patch the Nodes).

InstallSsmAgent: true
ManagedNodegroups:
-   Name: primary-compute
    InstanceType: m5.2xlarge
    LocalStorageSize: 128
    NodesNumDesired: 1
    NodesNumMax: 4
    NodesNumMin: 0
    Labels:
        instance-type: m5.2xlarge

once added, deploy the env again.

I agree with you that if our VPC and subnets weren't like how they are now, the installation would have gone smoothly.
I had to download cert-manager, cert-manager-cainjector and cert-manager-webhook images and push them to our private ecr and update deployment files to point to our ecr instead of public ecr to avoid failures in pods creations for those pods.

The installation managed to pass the previous error after applying your changes. However this is the new error we get(error log below the screenshot). I included the screenshot to show what is currently running in the cluster.

image

2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] deployment.apps/cluster-autoscaler configured
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] Error from server: error when creating ".orbit.out/orbit/kubectl/kube-system/00-observability.yaml": admission webhook "0500-amazon-eks-fargate-configmaps-admission.amazonaws.com" denied the request: Invalid value at auto_create_group On
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,079][sh.py : 30]
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] Traceback (most recent call last):
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/bin/codeseeder", line 8, in
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] sys.exit(main())
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] cli()
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call

image

this is not an error we've encountered before. and i was unable to reproduce it. i suppose you could try commenting out or removing line 21 from cli/aws_orbit/data/kubectl/kube_system/00-observability.yaml

but i don't know why this would be necessary. haven't you already done a deployment?

Thanks @chamcca