[BUG] -POD imagereplication-operator stack in Init:0/1
bozethe opened this issue · 8 comments
Describe the bug
I deployed the orbit environment stack using (orbit deploy env -f default-manifest.yaml) however the pipeline fails waiting for eks pods to be ready. All the pods get ready(please see the attached screenshots) except one(imagereplication-operator). It doesn't finish initializing. It stays on "Init:0/1". I have included more logs at the end of this.
To Reproduce
https://awslabs.github.io/aws-orbit-workbench/deploy-steps
- Clone AWS Labs github repository
- Install the CLI
only once - Install AWS CodeSeeder
only once - Generate a new manifest
once created, you will add / remove from this manifest as your platform changes - Deploy a new foundation
you may have an existing foundation (VPC, Subnets, EFS, and Cognito) that can be leveraged
this is OPTIONAL if you have the necessary components - Deploy a new toolkit
only once - Deploy credentials
only once
this is OPTIONAL - Deploy docker images
once deployed, you may deploy one or all the base workbench images as needed - Deploy environment
Expected behavior
I am expecting to see pod reporting running
orbit-system imagereplication-operator-658d9cdf94-jwfx5 1/1 Init:1/1
Screenshots
If applicable, add screenshots to help explain your problem.
kubectl get pods -A
kubectl log
kube describe pod
Additional context
{"type":"Recreate"},"revisionHistoryLimit":10,"progressDeadlineSeconds":600},"status":{"observedGeneration":1,"replicas":1,"updatedReplicas":1,"unavailableReplicas":1,"conditions":[{"type":"Available","status":"False","lastUpdateTime":"2022-07-21T13:29:18Z","lastTransitionTime":"2022-07-21T13:29:18Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."},{"type":"Progressing","status":"True","lastUpdateTime":"2022-07-21T13:47:21Z","lastTransitionTime":"2022-07-21T13:47:21Z","reason":"ReplicaSetUpdated","message":"ReplicaSet "imagereplication-operator-658d9cdf94" is progressing."}]}}
[2022-07-21 13:50:21,122][kubectl.py :592] orbit-system/imagereplication-operator not yet ready, sleeping for 1 minute
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 645, in deploy_env
name="imagereplication-operator", namespace="orbit-system", type="deployment", k8s_context=k8s_context
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 595, in _confirm_readiness
raise Exception("Timeout wating for Image Replicator to become ready")
Exception: Timeout wating for Image Replicator to become ready
[Container] 2022/07/21 13:51:21 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 13:51:21 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 13:51:21 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 13:51:21 Entering phase POST_BUILD
[Container] 2022/07/21 13:51:21 Running command . ~/.venv/bin/activate
[Container] 2022/07/21 13:51:21 Running command cd ${CODEBUILD_SRC_DIR}/bundle
[Container] 2022/07/21 13:51:21 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 13:51:21 Phase context status code: Message:
getting the full yaml or json definition of the Pod will also return the Status which may have additional info. can you run:
kubectl get pods -n orbit-system -o json imagereplication-operator-658d9cdf94-iwfx5
or
kubectl get pods -n orbit-system -o yaml imagereplication-operator-658d9cdf94-iwfx5
also, it might be worth trying to delete the pod and let the Deployment/ReplicaSet recreate it:
kubectl delete pods -n orbit-system imagereplication-operator-658d9cdf94-iwfx5
are you deploying orbit into isolated subnets? the image-replication operator and webhook should only be deployed if orbit is deployed in an isolated environment.
Below is our manifest file. Could you please confirm if its fine? I believe our private is more like an isolated subnet since it doesn't have route to 0.0.0.0/0. (see below).
Name: orbit
ScratchBucketArn: arn:aws:s3:::orbit-poc0000
UserPoolId: eu-central-000000000
SharedEfsFsId: fs-00000000
SharedEfsSgId: sg-00000000
Networking:
VpcId: vpc-00000000000
PublicSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
PrivateSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Data:
InternetAccessible: false
NodesSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Frontend:
LoadBalancersSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
#SslCertArn: !SSM ${/orbit-f/orbit/resources::SslCertArn}
Images:
JupyterUser:
Repository: 0000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/jupyter-user
Version: latest
OrbitController:
Repository: 00000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/orbit-controller
Version: latest
UtilityData:
Repository: 0000000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/utility-data
Version: latest
Teams:
- Name: sample-admin
Policies:- None
GrantSudo: true
Fargate: true
K8Admin: true
JupyterhubInboundRanges: - 0.0.0.0/0
EfsLifeCycle: AFTER_7_DAYS
Plugins: !include common_plugins.yaml
AuthenticationGroups: - sample-admin
- None
I checked the pods again now and they seems to be running all of them. (see below)
However, When I run the stack again (orbit deploy env) it now fails on a different error. (see below)
[2022-07-21 15:46:53,380][k8s.py : 37] Endpoint Subsets: [{'addresses': [{'hostname': None, 'ip': '10.10.10.112', 'node_name': 'ip-10-19-15-31.eu-central-1.compute.internal', 'target_ref': {'api_version': None, 'field_path': None, 'kind': 'Pod', 'name': 'imagereplication-pod-webhook-58df9d9cb4-xtb9s', 'namespace': 'orbit-system', 'resource_version': '9885', 'uid': '3a4560f2-fe0b-47b3-ab3f-5459a9c5f5b8'}}], 'not_ready_addresses': None, 'ports': [{'name': 'https', 'port': 443, 'protocol': 'TCP'}]}]
[2022-07-21 15:46:53,380][kubectl.py :578] Service: imagereplication-pod-webhook Namespace: orbit-system Hostname: None IP: 10.19.13.112
[2022-07-21 15:46:53,380][sh.py : 28] + kubectl rollout restart daemonsets -n orbit-system-ssm-daemons ssm-agent-installer --context AWSCodeBuild-a372ab74-b1d3-478b-871d-546f9f18a0ec@orbit-orbit.eu-central-1.eksctl.io
[2022-07-21 15:46:53,486][sh.py : 30] Error from server (NotFound): namespaces "orbit-system-ssm-daemons" not found
[2022-07-21 15:46:53,487][sh.py : 30]
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 649, in deploy_env
"kubectl rollout restart daemonsets -n orbit-system-ssm-daemons "
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 29, in run
for line in _run_iterating(cmd=cmd, cwd=cwd):
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 23, in _run_iterating
raise FailedShellCommand(f"Exit code: {p.returncode}")
aws_orbit.exceptions.FailedShellCommand: Exit code: 1
[Container] 2022/07/21 15:46:53 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 15:46:53 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 15:46:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 15:46:53 Entering phase POST_BUILD
[Container] 2022/07/21 15:46:53 Running command . ~/.venv/bin/activate
[Container] 2022/07/21 15:46:53 Running command cd ${CODEBUILD_SRC_DIR}/bundle
[Container] 2022/07/21 15:46:53 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 15:46:53 Phase context status code: Message:
ok. InternetAccessible: false
in your manifest does flag your subnets as "isolated", meaning there's no route to the internet. as you said, you have no route to 0.0.0.0/0. we don't get a lot of deployments set up like that, there does appear to be a bug in the deployment that will fail if InternetAccessible: false
and the SSM Agent isn't installed on the Nodes. also, you don't have a compute NodeGroup defined in your manifest, so there won't be any nodes to deploy user notebooks on. I recommend adding the following to your manifest which will create a NodeGroup (customize as you see fit) and force installation of the SSM Agent on the Nodes (a recommended best practice as it it can allow SSM to patch the Nodes).
InstallSsmAgent: true
ManagedNodegroups:
- Name: primary-compute
InstanceType: m5.2xlarge
LocalStorageSize: 128
NodesNumDesired: 1
NodesNumMax: 4
NodesNumMin: 0
Labels:
instance-type: m5.2xlarge
once added, deploy the env again.
I agree with you that if our VPC and subnets weren't like how they are now, the installation would have gone smoothly.
I had to download cert-manager, cert-manager-cainjector and cert-manager-webhook images and push them to our private ecr and update deployment files to point to our ecr instead of public ecr to avoid failures in pods creations for those pods.
The installation managed to pass the previous error after applying your changes. However this is the new error we get(error log below the screenshot). I included the screenshot to show what is currently running in the cluster.
2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] deployment.apps/cluster-autoscaler configured
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] Error from server: error when creating ".orbit.out/orbit/kubectl/kube-system/00-observability.yaml": admission webhook "0500-amazon-eks-fargate-configmaps-admission.amazonaws.com" denied the request: Invalid value at auto_create_group On
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,079][sh.py : 30]
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] Traceback (most recent call last):
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/bin/codeseeder", line 8, in
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] sys.exit(main())
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] cli()
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
this is not an error we've encountered before. and i was unable to reproduce it. i suppose you could try commenting out or removing line 21 from cli/aws_orbit/data/kubectl/kube_system/00-observability.yaml
but i don't know why this would be necessary. haven't you already done a deployment?