[BUG] - error: a container name must be specified for pod istio-pilot
bozethe opened this issue · 28 comments
Describe the bug
The pods in the istio-system namespace report running but the are some errors when I check the logs. I checked the pilot pod logs and it has error "error: a container name must be specified for pod istio-pilot-557c68674-4vqcv, choose one of: [discovery istio-proxy]".
To Reproduce
Steps to reproduce the behavior:
Clone AWS Labs github repository
Install the CLI
Install AWS CodeSeeder
Generate a new manifest
Deploy a new foundation
Deploy a new toolkit
Deploy credentials
Deploy docker images
Deploy environment)
Expected behavior
A clear and concise description of what you expected to happen.
The pilot pod is showing running but the logs show below error.
THe local-gateway is waiting on pilot to be available.
kubectl logs cluster-local-gateway-584fd5c966-nlrdc -n istio-system -f
Additional context
Add any other context about the problem here.
orbit-workb.txt
that particular error message isn't in the pod logs, it is an error being returned by the kubectl command attempting to read the logs. the istio-pilot pod has two containers that can contain logs named "discovery" and "istio-proxy". you have to specify the container name when requesting the logs:
kubectl logs isito-pilot-557c-68674-4vqcv -n istio-system discovery
or
kubectl logs isito-pilot-557c-68674-4vqcv -n istio-system istio-proxy
Thanks, I see its working fine. I have made lots changes so far to get orbit to work our environment. aws-ingresscontroller 1.1.5 that is deployed by orbit wasn't working for us. It failed to to discover the subnets where to launch the alb even though the tags were correct. I ended up deploying aws-ingress-controller 2.4.2 and it seems to work fine. I had to also recreate the ingress.
Below is the screenshare of our ingress.
Below is what I get when trying to access the orbit login page.
And I checked orbit landing page and below is the error (KeyError: 'HTTP_X_AMZN_OIDC_DATA').
nding-page-service.orbit-system.svc.cluster.local:80/orbit/*", "X-Istio-Attributes": "CjIKGGRlc3RpbmF0aW9uLnNlcnZpY2UubmFtZRIWEhRsYW5kaW5nLXBhZ2Utc2VydmljZQovCh1kZXN0aW5hdGlvbi5zZXJ2aWNlLm5hbWVzcGFjZRIOEgxvcmJpdC1zeXN0ZW0KTwoKc291cmNlLnVpZBJBEj9rdWJlcm5ldGVzOi8vaXN0aW8taW5ncmVzc2dhdGV3YXktNzc3YjU0ZDk2OC1zanp0ci5pc3Rpby1zeXN0ZW0KTwoXZGVzdGluYXRpb24uc2VydmljZS51aWQSNBIyaXN0aW86Ly9vcmJpdC1zeXN0ZW0vc2VydmljZXMvbGFuZGluZy1wYWdlLXNlcnZpY2UKUQoYZGVzdGluYXRpb24uc2VydmljZS5ob3N0EjUSM2xhbmRpbmctcGFnZS1zZXJ2aWNlLm9yYml0LXN5c3RlbS5zdmMuY2x1c3Rlci5sb2NhbA==", "X-B3-Traceid": "ebc065a24b50007fca4ba1ba30d3ce28", "X-B3-Spanid": "ca4ba1ba30d3ce28", "X-B3-Sampled": "1", "X-Envoy-Original-Path": "/orbit/login", "Content-Length": "0"}
[2022-08-02 06:42:41 +0000] [8] [ERROR] Error handling request /login
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 136, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2464, in call
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2450, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1867, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functionsrule.endpoint
File "/var/orbit-controller/orbit_controller/server.py", line 44, in login_request
return login(logger=app.logger, app=app)
File "/var/orbit-controller/orbit_controller/home.py", line 47, in login
email, username, groups = _get_user_info_from_jwt(logger)
File "/var/orbit-controller/orbit_controller/home.py", line 119, in _get_user_info_from_jwt
encoded_jwt = request.headers["x-amzn-oidc-data"]
File "/usr/local/lib/python3.8/site-packages/werkzeug/datastructures.py", line 1463, in getitem
return unicodify_header_value(self.environ["HTTP" + key])
KeyError: 'HTTP_X_AMZN_OIDC_DATA'
If I put the alb url without "/orbit/login", below is what we get
I am going to attached our orbit ingress yaml here just to check if It is not missing anything.
ingress.yaml.txt
what are you using for your Identity Provider? if you haven't integrated with your own IdP and are using Cognito then you need to go into Cognito in the AWS Console, find the UserPool for your deployment (it will be named orbit-[ENV_NAME]-user-pool
) and then make sure there is an orbit
user. You will also need to create Groups in Cognito that match the Teams in your manifest. Groups should be named [ENV_NAME]-[TEAM_NAME]
. Once Groups are created add the orbit
User to the Groups
also, you will need to logout and log back in to trigger creation of the Team Space. you can force this by visiting the /orbit/logout
URL
that's exactly what you should see. from there you can click on "Notebook Servers" on the left to launch or connect to previously launched notebooks.
we've not encountered a 503 on the status check before.. we have seen with complex network setups cause timeouts, but not a 503 returned by the jupyter server. did the testing-0
notebook/pod ever start? there are some advanced things you can do to enable --debug mode on the jupyter server and may get additional info from the server logs.
some assumptions about your setup i'm making based on the info in the screenshots: the Team is named tieho
and User is boqo
. these are important. you if this is correct, then in addition to the User specific namespace called tieho-boqo
there should also be a namespace called tieho
.
some info, we use a CRD (CustomResourceDefinition) and custom Operator to make Orbit specific changes to the jupyter Pods when they start up. the CRD is called a PodSetting. each Team gets a PodSetting that can be configured to make changes specific to the Team. to enable --debug logging on new Pods, we can modify the PodSetting for the Team and add a --debug command line parameter. all new notebooks will be started with the new parameter and should have additional logging.
you can try to enable debug logging, i'm not sure what you'll find.
kubectl edit podsettings -n tieho orbit-pod-interactive-notebook
this will start an edit session invi
for the PodSetting- locate the following line in the command section:
- /usr/local/bin/start.sh jupyter lab --ServerApp.notebook_dir=/home/jovyan --ServerApp.ip=0.0.0.0
- press
<esc>
theni
to enter insert mode - add the --debug flag to the command:
- /usr/local/bin/start.sh jupyter lab --debug --ServerApp.notebook_dir=/home/jovyan --ServerApp.ip=0.0.0.0
- press
<esc>
thenwq<enter>
to write (save) and quit the edit session
then try creating a new Notebook through the UI. once it starts, you can get the logs for it with kubectl again.
Any Notebooks you create for users on that Team will now have debug logs. you can undo this by repeating the procedure and removing the --deubg flag.
Hi @chamcca, sorry for late reply. I was on leave.
I check the namespaces I have and below is what I see.
Below is snippet from our manifest.yaml file. The Team name is sample-admin not tieho
I also checked if there are any resources in sample-admin or sample-admin-orbit namespaces but nothing.
❯ kubectl get all -n sample-admin-orbit
No resources found in sample-admin-orbit namespace.
I checked the podsettings we have in our cluster and below is what I see.
Are we suppose to have pods running in these other namespaces (sample-admin or sample-admin-orbit namespaces)?
If possible, could you please send screenshot of how your eks cluster looks like
I will try to enable debugging and revert.
I have enabled debug mode by editing "kubectl edit podsettings -n sample-admin orbit-pod-interactive-notebook" and below is how it looks now.
However the logs are still showing the same.
I deleted all the "Notebook Servers" and created a new one called debug. see below.
I also attached our manifest file. could you please check if its fine.
our-manifest.yaml.txt
it appears as though you created another User (tieho-boqo) through the kubeflow onboarding mechanism. this User is not a member of a TeamSpace and is not managed by Orbit. you should be using the sample-admin-orbit User to start your Notebooks. Notebooks under the tieho-boqo User/namespace will not work.
revisit the /orbit/login page. you should be shown the "Welcome orbit!" message to your user (orbit) and the teams that you belong to (sample-admin). click on the Kubeflow icon next to the Team on this page to be take to the Kubeflow UI for that Team/User. from there, access the Notebook Servers UI and try to start a Notebook. this will start in the sample-admin-orbit namespace which is managed by Orbit and will apply PodSettings to correctly configure the Notebook.
I logged out and now login again
clicked on kubeflow logo and now creating notebook server
This is out cognito user pool.
Still getting
upstream connect error or disconnect/reset before headers. reset reason: connection failure
I believe there is something I am not understanding here.
Do you mind to have a 10 minutes zoom/teams call?
you're still creating Notebooks in the wrong namespace (tieho-boqo). after logging in, before clicking on the Notebook Servers link, try clicking the drop down box in the upper left in the primary kubeflow UI. where it says "tieho-boqo (owner)". if your teams are setup correctly you should be able to switch to the sample-admin-orbit namespace.
Notebooks created in the tieho-boqo namespace will continue to fail. an alternative is to completely delete this namespace and all resources so that you don't get pushed to it by the UI. that can be done from kubectl. kubeflow creates a CRD called a "Profile". if you list the profiles using:
kubectl get profiles
you likely have 3: anonymous, sample-admin-orbit, tieho-boqo. deleting the tieho-boqo can be done with:
kubectl delete profiles tieho-boqo
this should leave your cognito orbit user assigned to a single profile, sample-admin-orbit.
I tried clicking the drop down arrow under my name "tieho-boqo owner" but there was nothing.
My cognito user group is "orbit-sample-admin". Is that ok or is it supposed to be "sample-admin-orbit"?
I deleted tieho-boqo profile and the pods that were running in the tieho-boqo namespace are gone.
below is what i get when trying to login now.
I check the name spaces and I noticed the sample-admin-orbit and sample-admin namespaces were still existing after deleting profile. I deleted them then logout and login on orbit home page and above is what I get. I tried several times and I see it just created the namespace "sample-admin-orbit" again but still shows "Timeout while waiting for namespace creation. Something went wrong!! Consult the Orbit namespace watcher logs."
I see where the other namespace came from. accessing the orbit url without /orbit/login.
❯ kubectl get teamspace -n sample-admin -o yaml
apiVersion: v1
items:
- apiVersion: orbit.aws/v1
kind: TeamSpace
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"orbit.aws/v1","kind":"TeamSpace","metadata":{"annotations":{},"name":"sample-admin","namespace":"sample-admin"},"spec":{"env":"orbit","space":"team","team":"sample-admin"}}
creationTimestamp: "2022-08-10T17:16:54Z"
generation: 1
name: sample-admin
namespace: sample-admin
resourceVersion: "12063392"
uid: 60d7124f-99ab-4d58-848d-573001ac89ac
spec:
env: orbit
space: team
team: sample-admin
kind: List
metadata:
resourceVersion: ""
creating namespaces with the kubeflow Namespace UI is not supported, these namespaces will not function. team namespaces are created when orbit deploy teams
is run. user namespaces are created when a user first visits /orbit/login immediately after logging in. unfortunately we don't have a way of disabling creation with the Namespace UI.
deleting the sample-admin namespace broke some things. this namespace belongs to the sample-admin Team and is created when the Orbit Teams are deployed. we can clean this up, though.
- visit the /orbit/logout url. browsers sometimes cache this page, so you may need to refresh it to force the logout. if it routes you to /orbit/login and/or presents a login screen, do not login again yet.
- we're going to cleanup the user and team namespaces and profiles. some of these commands may fail if the resources don't exist, that's ok
you should also delete any namespaces and profiles that were created as a result of using the kubeflow Namespace UI
kubectl delete namespace sample-admin-orbit kubectl delete namespace sample-admin kubectl delete profile sample-admin-orbit
- now we need to repair the Team namespace and teamspace resources:
this will recreate the kubernetes resources required for your sample-admin team
orbit deploy teams -f [path_to_your_manifest]
- now we login to the UI again and force creation of the user namespace and userspace. visit the /orbit/login url, if logout was successful you should be presented with username and password prompt. you may need to force a refresh. we need to ensure that the username/password prompt is presented again and the user is authenticated again. this is what triggers creation of the user's namespace
- once logged in, you should be presented with the sample-admin Team on the /orbit/login page, click on the Kubeflow icon next to this Team
- in the kubeflow UI, verify in the drop down in the upper left that you are in the sample-admin-orbit namespace
- click Notebook Servers and try to create a notebook
I have just broken and then followed this procedure on my development cluster to confirm it.
I have tried the steps you provided above but the problem still exist. I also deleted the groups, created a new cognito and tested again but still the problem persist.
This is our cognito user pool name
User group. I followed the naming you suggested
My new teams config
Teams:
- Name: oryx
Policies:- None
GrantSudo: true
Fargate: true
K8Admin: true
JupyterhubInboundRanges: - 0.0.0.0/0
EfsLifeCycle: AFTER_7_DAYS
Plugins: !include common_plugins.yaml
AuthenticationGroups: - orbit-oryx
- None
landing page logs show that the groups are not being returned
were you able to resolve the missing teams/groups?
no, still have the same problem. we are thinking of ways of hard coding the teams/group for testing purposes. I am not sure if changing code in home.py line 45(def login function) will help.
Do you have any other suggestion for us to make this thing work?
I managed to solve the problem. I deleted the orbit-system namespace and redeployed env and teams stack.
I am now getting error
message: 'Internal error occurred: failed calling webhook "imagereplication-pod-webhook.orbit-system.svc":
Post "https://imagereplication-pod-webhook.orbit-system.svc:443/update-pod-images?timeout=30s":
service "imagereplication-pod-webhook" not found'
when you deleted the orbit-system namespace you removed the webhook declaration and pods. the image-replicator webhook updates Pod images to point to ECR and initiates a replication of public images into ECR for environments with isolated subnets. in theory, redeploying the env should have recreated them, but i've never attempted that.
Thank you for continued help. I will remove teams, env stack redeploy then again and see if everything will be ok.
Is there a way to setup proxy for image pulling for pods?
this is usually a networking issue between EFS and the Nodes or Pods in the cluster. did you deploy EFS with the orbit deploy foundation
or did you deploy your own EFS? can you confirm which Subnets the EFS shares are in and that they are the same as the Nodes/Pods. also check the Security Group attached to the EFS share and the Inbound rules on it. i know that you are operating in isolated subnets that were deployed outside the orbit tooling, there may just be something we need to get "hooked" up with the security groups.
I managed to fix the problem, I was hitting this bug (kubernetes-sigs/aws-efs-csi-driver#214)