[BUG] -POD imagereplication-operator stack in Init:0/1

Question

[BUG] -POD imagereplication-operator stack in Init:0/1

bozethe opened this issue 2 years ago · 8 comments

Describe the bug
I deployed the orbit environment stack using (orbit deploy env -f default-manifest.yaml) however the pipeline fails waiting for eks pods to be ready. All the pods get ready(please see the attached screenshots) except one(imagereplication-operator). It doesn't finish initializing. It stays on "Init:0/1". I have included more logs at the end of this.

To Reproduce
https://awslabs.github.io/aws-orbit-workbench/deploy-steps

Clone AWS Labs github repository
Install the CLI
only once
Install AWS CodeSeeder
only once
Generate a new manifest
once created, you will add / remove from this manifest as your platform changes
Deploy a new foundation
you may have an existing foundation (VPC, Subnets, EFS, and Cognito) that can be leveraged
this is OPTIONAL if you have the necessary components
Deploy a new toolkit
only once
Deploy credentials
only once
this is OPTIONAL
Deploy docker images
once deployed, you may deploy one or all the base workbench images as needed
Deploy environment

Expected behavior
I am expecting to see pod reporting running

orbit-system imagereplication-operator-658d9cdf94-jwfx5 1/1 Init:1/1

Screenshots
If applicable, add screenshots to help explain your problem.
kubectl get pods -A

kubectl log

kube describe pod

Additional context
{"type":"Recreate"},"revisionHistoryLimit":10,"progressDeadlineSeconds":600},"status":{"observedGeneration":1,"replicas":1,"updatedReplicas":1,"unavailableReplicas":1,"conditions":[{"type":"Available","status":"False","lastUpdateTime":"2022-07-21T13:29:18Z","lastTransitionTime":"2022-07-21T13:29:18Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2022-07-21T13:47:21Z","lastTransitionTime":"2022-07-21T13:47:21Z","reason":"ReplicaSetUpdated","message":"ReplicaSet "imagereplication-operator-658d9cdf94" is progressing."}]}}

[2022-07-21 13:50:21,122][kubectl.py :592] orbit-system/imagereplication-operator not yet ready, sleeping for 1 minute
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 645, in deploy_env
name="imagereplication-operator", namespace="orbit-system", type="deployment", k8s_context=k8s_context
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 595, in _confirm_readiness
raise Exception("Timeout wating for Image Replicator to become ready")
Exception: Timeout wating for Image Replicator to become ready

[Container] 2022/07/21 13:51:21 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 13:51:21 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 13:51:21 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 13:51:21 Entering phase POST_BUILD
[Container] 2022/07/21 13:51:21 Running command . ~/.venv/bin/activate

[Container] 2022/07/21 13:51:21 Running command cd ${CODEBUILD_SRC_DIR}/bundle

[Container] 2022/07/21 13:51:21 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 13:51:21 Phase context status code: Message:

Answer 1 · 2022-07-21T15:31:58.000Z

getting the full yaml or json definition of the Pod will also return the Status which may have additional info. can you run:

kubectl get pods -n orbit-system -o json imagereplication-operator-658d9cdf94-iwfx5

or

kubectl get pods -n orbit-system -o yaml imagereplication-operator-658d9cdf94-iwfx5

also, it might be worth trying to delete the pod and let the Deployment/ReplicaSet recreate it:

kubectl delete pods -n orbit-system imagereplication-operator-658d9cdf94-iwfx5

Answer 2 · 2022-07-21T15:34:10.000Z

are you deploying orbit into isolated subnets? the image-replication operator and webhook should only be deployed if orbit is deployed in an isolated environment.

Answer 3 · 2022-07-21T16:12:10.000Z

Below is our manifest file. Could you please confirm if its fine? I believe our private is more like an isolated subnet since it doesn't have route to 0.0.0.0/0. (see below).

Name: orbit
ScratchBucketArn: arn:aws:s3:::orbit-poc0000
UserPoolId: eu-central-000000000
SharedEfsFsId: fs-00000000
SharedEfsSgId: sg-00000000
Networking:
VpcId: vpc-00000000000
PublicSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
PrivateSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Data:
InternetAccessible: false
NodesSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Frontend:
LoadBalancersSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
#SslCertArn: !SSM ${/orbit-f/orbit/resources::SslCertArn}
Images:
JupyterUser:
Repository: 0000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/jupyter-user
Version: latest
OrbitController:
Repository: 00000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/orbit-controller
Version: latest
UtilityData:
Repository: 0000000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/utility-data
Version: latest
Teams:

Name: sample-admin
Policies:
- None
  GrantSudo: true
  Fargate: true
  K8Admin: true
  JupyterhubInboundRanges:
- 0.0.0.0/0
  EfsLifeCycle: AFTER_7_DAYS
  Plugins: !include common_plugins.yaml
  AuthenticationGroups:
- sample-admin

I checked the pods again now and they seems to be running all of them. (see below)

However, When I run the stack again (orbit deploy env) it now fails on a different error. (see below)

[2022-07-21 15:46:53,380][k8s.py : 37] Endpoint Subsets: [{'addresses': [{'hostname': None, 'ip': '10.10.10.112', 'node_name': 'ip-10-19-15-31.eu-central-1.compute.internal', 'target_ref': {'api_version': None, 'field_path': None, 'kind': 'Pod', 'name': 'imagereplication-pod-webhook-58df9d9cb4-xtb9s', 'namespace': 'orbit-system', 'resource_version': '9885', 'uid': '3a4560f2-fe0b-47b3-ab3f-5459a9c5f5b8'}}], 'not_ready_addresses': None, 'ports': [{'name': 'https', 'port': 443, 'protocol': 'TCP'}]}]
[2022-07-21 15:46:53,380][kubectl.py :578] Service: imagereplication-pod-webhook Namespace: orbit-system Hostname: None IP: 10.19.13.112
[2022-07-21 15:46:53,380][sh.py : 28] + kubectl rollout restart daemonsets -n orbit-system-ssm-daemons ssm-agent-installer --context AWSCodeBuild-a372ab74-b1d3-478b-871d-546f9f18a0ec@orbit-orbit.eu-central-1.eksctl.io
[2022-07-21 15:46:53,486][sh.py : 30] Error from server (NotFound): namespaces "orbit-system-ssm-daemons" not found
[2022-07-21 15:46:53,487][sh.py : 30]
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 649, in deploy_env
"kubectl rollout restart daemonsets -n orbit-system-ssm-daemons "
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 29, in run
for line in _run_iterating(cmd=cmd, cwd=cwd):
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 23, in _run_iterating
raise FailedShellCommand(f"Exit code: {p.returncode}")
aws_orbit.exceptions.FailedShellCommand: Exit code: 1

[Container] 2022/07/21 15:46:53 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 15:46:53 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 15:46:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 15:46:53 Entering phase POST_BUILD
[Container] 2022/07/21 15:46:53 Running command . ~/.venv/bin/activate

[Container] 2022/07/21 15:46:53 Running command cd ${CODEBUILD_SRC_DIR}/bundle

[Container] 2022/07/21 15:46:53 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 15:46:53 Phase context status code: Message:

Answer 4 · 2022-07-21T16:31:44.000Z

ok. InternetAccessible: false in your manifest does flag your subnets as "isolated", meaning there's no route to the internet. as you said, you have no route to 0.0.0.0/0. we don't get a lot of deployments set up like that, there does appear to be a bug in the deployment that will fail if InternetAccessible: false and the SSM Agent isn't installed on the Nodes. also, you don't have a compute NodeGroup defined in your manifest, so there won't be any nodes to deploy user notebooks on. I recommend adding the following to your manifest which will create a NodeGroup (customize as you see fit) and force installation of the SSM Agent on the Nodes (a recommended best practice as it it can allow SSM to patch the Nodes).

InstallSsmAgent: true
ManagedNodegroups:
-   Name: primary-compute
    InstanceType: m5.2xlarge
    LocalStorageSize: 128
    NodesNumDesired: 1
    NodesNumMax: 4
    NodesNumMin: 0
    Labels:
        instance-type: m5.2xlarge

once added, deploy the env again.

Answer 5 · 2022-07-21T17:16:46.000Z

I agree with you that if our VPC and subnets weren't like how they are now, the installation would have gone smoothly.
I had to download cert-manager, cert-manager-cainjector and cert-manager-webhook images and push them to our private ecr and update deployment files to point to our ecr instead of public ecr to avoid failures in pods creations for those pods.

Answer 6 · 2022-07-22T09:25:10.000Z

The installation managed to pass the previous error after applying your changes. However this is the new error we get(error log below the screenshot). I included the screenshot to show what is currently running in the cluster.

2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] deployment.apps/cluster-autoscaler configured
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] Error from server: error when creating ".orbit.out/orbit/kubectl/kube-system/00-observability.yaml": admission webhook "0500-amazon-eks-fargate-configmaps-admission.amazonaws.com" denied the request: Invalid value at auto_create_group On
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,079][sh.py : 30]
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] Traceback (most recent call last):
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/bin/codeseeder", line 8, in
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] sys.exit(main())
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] cli()
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call

Answer 7 · 2022-07-25T16:08:38.000Z

this is not an error we've encountered before. and i was unable to reproduce it. i suppose you could try commenting out or removing line 21 from cli/aws_orbit/data/kubectl/kube_system/00-observability.yaml

but i don't know why this would be necessary. haven't you already done a deployment?

Answer 8 · 2022-07-26T05:09:15.000Z

Thanks @chamcca