odpi/egeria-ui

Cannot login to UI (helm chart)

planetf1 opened this issue ยท 36 comments

egeria-ui 4.0.1 was recently announced.
The currently released egeria charts uses 3.2.0 and continues to work.
However for testing I tried running with version 4.0.1 of the UI

For example:

 helm install lab egeria/odpi-egeria-lab --set jupyter.tokenPlain='s3cr3t!'  --set-string image.uistatic.tag=4.0.1

I then ran the configure/start/building a data catalog notebooks

With the ports forwarded, and attemt to go to the nginx port (https://localhost:443) results in continually 'loading' window
Screenshot 2022-09-23 at 15 30 04

I would presume we need some updates for the new UI to work correctly - configuration?

@sarbull @lpalashevski

The only activity seen in the nginx pod is

127.0.0.1 - - [23/Sep/2022:14:26:48 +0000] "GET /nginx_status/ HTTP/1.1" 200 640 "-" "Sysdig Agent/1.0"
127.0.0.1 - - [23/Sep/2022:14:26:49 +0000] "GET /nginx_status/ HTTP/1.1" 200 640 "-" "Sysdig Agent/1.0"
127.0.0.1 - - [23/Sep/2022:14:26:50 +0000] "GET /nginx_status/ HTTP/1.1" 200 640 "-" "Sysdig Agent/1.0"
172.17.57.135 - - [23/Sep/2022:14:26:50 +0000] "GET / HTTP/1.0" 200 640 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
172.17.57.135 - - [23/Sep/2022:14:26:50 +0000] "GET /manifest.json HTTP/1.0" 200 519 "https://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
172.17.57.135 - - [23/Sep/2022:14:26:50 +0000] "GET /static/css/main.8a4273b5.css HTTP/1.0" 200 190018 "https://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
172.17.57.135 - - [23/Sep/2022:14:26:50 +0000] "GET /static/js/main.d3c2741a.js HTTP/1.0" 200 3873780 "https://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
172.17.57.135 - - [23/Sep/2022:14:26:50 +0000] "GET /favicon.ico HTTP/1.0" 200 15406 "https://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
172.17.57.135 - - [23/Sep/2022:14:26:50 +0000] "GET /egeria-logo.svg HTTP/1.0" 200 6658 "https://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
127.0.0.1 - - [23/Sep/2022:14:26:51 +0000] "GET /nginx_status/ HTTP/1.1" 200 640 "-" "Sysdig Agent/1.0"
127.0.0.1 - - [23/Sep/2022:14:26:52 +0000] "GET /nginx_status/ HTTP/1.1" 200 640 "-" "Sysdig Agent/1.0"

Whilst the most recent log from the backend UI container ends:

2022-09-23 14:14:07.518 - INFO 1 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8443 (https) with context path ''
2022-09-23 14:14:07.579 - INFO 1 --- [           main] o.o.o.u.u.springboot.EgeriaUIPlatform    : Started EgeriaUIPlatform in 30.35 seconds (JVM running for 32.378)
โžœ  master git:(master)

The UI itself may also need a code enhancement so that it does not continually hang if misconfigured - timeout/error ? @sarbull

We will target this to coincide with the 3.13 release (work during Oct, release by ~ 1 Nov)

Plan to revisit this in time for the 3.13 release since the new UI should now be ready.

Likely that there are some minor changes relating to environment

I took a look at this again now 3.13 is released.

Installed with the additional config file (add -f ~/thisfile.yaml)

image:
  uistatic:
    tag: "4.1.0"

I get an nginx 'Bad gateway' error on a service pointing to the nginx pod (port 8443)

the deployed configuration on nginx is:

server {

    listen                8443 ssl;
    #listen                 80;
    server_name           lab-nginx;
    ssl_certificate       /etc/nginx/ssl/tls.crt;
    ssl_certificate_key   /etc/nginx/ssl/tls.key;
    ssl_password_file     /etc/nginx/pass/pass.txt;


    #root /var/www/;
    #index index.html;

    # Force all paths to load either itself (js files) or go through index.html.
    location /api {
        proxy_pass https://lab-ui:8443;
        proxy_set_header Host $http_host;
        proxy_ssl_verify       off;
        proxy_ssl_session_reuse on;
        proxy_ssl_server_name on;
    }

    location / {
        proxy_pass http://lab-uistatic:8080;
        proxy_set_header Host $http_host;
        proxy_ssl_verify       off;
        proxy_ssl_session_reuse on;
        proxy_ssl_server_name on;
    }

}

Checking within the nginx container we see we cannot connect to the latter:

1000660000@lab-odpi-egeria-lab-nginx-6fbf4bc8bc-4kgff:/etc/nginx/conf.d$ curl http://lab-uistatic:8080
curl: (7) Failed to connect to lab-uistatic port 8080: Connection refused

Looking at the container definition (in our charts)for egeria-ui it only declares 8080 as the exposed port.

HOWEVER in it's configuration we have:

# SPDX-License-Identifier: Apache-2.0
# Copyright Contributors to the Egeria project

server {
  listen 80;
  server_name _;

  root /var/www/;
  index index.html;

  # Force all paths to load either itself (js files) or go through index.html.
  location / {
    try_files $uri /index.html;
  }
}

Inspecting the container image, in our staticui layer, only port 80 is exposed, and the docs in the embedded metadata refer to port 80. This is true for the base nginx image used by the UI

As such it seems the //intent// is now for the staticui container to expose port 80 (not 8080)

If we compare this with the older UI (version 3.2.0) used in the helm charts up until 3.2.0 we see the config file is:

$ cat staticui.conf
# SPDX-License-Identifier: Apache-2.0
# Copyright Contributors to the Egeria project

server {
    listen 8080;
    server_name _;

    root /var/www/;
    index index.html;

    # Force all paths to load either itself (js files) or go through index.html.
    location / {
        try_files $uri /index.html;
    }
}

confirming the 8080 port was correct then, and indeed the UI works correctly.

The fix therefore is to either

  • change the container to work with port 8080 by default
  • change the k8s deployment to use port 80

(Aside: We could also get rid of the additional nginx component, but this can be a seperate issue for optimization)

Since remapping is simple, the minimal change is the latter. I will do this. Any subsequent change can be done via a new issue/PR cc: @lpalashevski @sarbull

I've added a conditional check . If version >=4 we basically use the port 80 setup. otherwise we use port 8080. Should work for old and new UI. Probably unnecessary, but it makes it easier to test the new UI in the charts. May then remove in a future release.

Having implemented this check, we still cannot run with the container exposing port 80, since we need (want) to run our charts without root priviliges.

ie the pods running on k8s will not be running under root, therefore no ports under 1024 can be bound to.

This is in fact why we ran on 8080 in the first place

The docker container should setup the default to run on a port not requiring root. Optional to provide the ability to configure this differently if required.

If this is changed to port 8080, no change to the charts is then required.

Therefore reassigning this to the egeria-ui component

@sarbull @lpalashevski We need to change the default port for the docker container produced by egeria UI to > 1024, and ideally 8080 to keep it the same as the previous UI

I am happy to make change or you can. This requires a change to Dockerfile & etc/nginx.conf. I don't think anything else. Happy to make this change and test in k8s if you like.

Confirmed on openshift dashboard

  • 3.x container is using port 8080, 4.x container is using 80
  • the pod definition in both cases (same chart!) is using 8080

In both cases our nginx related files in the helm definition create config files

/etc/nginx.conf is set to fixed content from within 'egeria-uistatic.yaml'. this is done by creating a configmap, which is mounted as that file in the container filesystem:

    # SPDX-License-Identifier: Apache-2.0
    # Copyright Contributors to the Egeria project.
    worker_processes  auto;
    error_log  /var/log/nginx/error.log notice;
    events {
      worker_connections  1024;
    }
    pid        /tmp/nginx.pid;
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;
      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
      access_log  /var/log/nginx/access.log  main;
      sendfile        on;
      keepalive_timeout  65;
      client_body_temp_path /tmp/client_temp;
      proxy_temp_path       /tmp/proxy_temp_path;
      fastcgi_temp_path     /tmp/fastcgi_temp;
      uwsgi_temp_path       /tmp/uwsgi_temp;
      scgi_temp_path        /tmp/scgi_temp;
      include /etc/nginx/conf.d/*.conf;
    }

The actual server definition is then built by creating a template in nginx. this is taken from etc/staticui.conf.template. That is placed into a volume which is mounted at /etc/nginx/templates :

# SPDX-License-Identifier: Apache-2.0
# Copyright Contributors to the Egeria project

server {
    listen 8080;
    server_name _;

    root /var/www/;
    index index.html;

    # Force all paths to load either itself (js files) or go through index.html.
    location / {
        try_files $uri /index.html;
    }
}

This is then copied by the nginx container at startup, with variable replacement, and ends up with a correct port 8080 configuration

However it seems as if whilst the old nginx image does this templating, the new version does not .... and therein lies the reason that nginx starts on the wrong port.

Need to look into why templating isn't working / what's changed.

Alternatively we could build explicit config files from within the helm template.

In the old template, we left the entrypoint as it:
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh

When we look at the cmd/entrypoint definitions.

Old:

            "Cmd": [
                "nginx",
                "-g",
                "daemon off;"
            ],
            "Entrypoint": [
                "/docker-entrypoint.sh"

New:

            "Entrypoint": [
                "nginx",
                "-g",
                "daemon off;"
            ],

In the new image the entrypoint is overwritten by the Dockerfile:
ENTRYPOINT ["nginx", "-g", "daemon off;"]

Since we are reusing a standard image, which already defines a good default entrypoint, we have no need to override

Two other observations

  • We do write a default.conf file to the image. this will be overwritten in the k8s environment, but will be applied for standalone containers. This is configured for port 80. I will not modify, but an issue should be raised to document behaviour, - and change default if needed
  • This container build is slow for multi-arch, as a full npm build & container build is done each time. Since all we are doing is copying built files, we could do either of these to reduce the additional container build to seconds from ~10 minutes (on m1 mac)
    a) Copy files from the build directory. This does make the docker build slightly less portable as the host environment needs npm tools
    b) Copy files in at runtime. But this means no pre-built image. good for maintainance, but probably less easy to use.

For now I will not change or raise issues on the above.

Note that any change to the container will need a 'release' to push

๐Ÿ‘€

Thanks, I'm going to update the charts to a prerelease level to include the current UI

@sarbull I'll need a new container image published for egeria-ui with those fixes before it makes sense to merge the chart update.

ok, will do, i'll let you know

Agreed on commmunity call 20230201 that we'd like to add into v4

cc: @lpalashevski

Now that an image is available I have merged the change I made to the charts

Current status

  • Can connect to ui (static) pod 8443 & get login panel
  • admin/secret does not work

@lpalashevski

We now have additional fixes for the ui-chassis which are being integrated into release 4

This is currently building, will be merged, and then I will rerun the release pipeline. We should then have a complete v4 build later.

There is also a js commons fix made under

We will need a new 'egeria-ui' container image, ideally at a release version (or make it a prerelease, we test, then repeat with final version). Can you arrange this @sarbull @lpalashevski @bogdan-sava

The charts use nginx, and redirect /api to the ui-chassis container. I think we expect this to work ok with the above changes?

@bogdan-sava is everything in pace now?

if so next steps for the UI are:

  • release new version for egeria-js-comons
  • update reference to egeria-js-commons in egeria-ui
  • release egeria-version whcih triggers docker image build

If all commits are in I can take over with the versions and build the image.

All good except for the cors. It works only with nginx or similar solution.

@lpalashevski The release pipeline has been started at https://github.com/odpi/egeria/actions/runs/4545935086

We need an 'egeria-ui' container image (ie your 2 first bullets above )

I am looking for fixing the cors issues also

If we're ok ui is good enough for now, would be good to go with the v4 builds we have as final and fix cors fully for 4.1?
I think, that we think, that it should work with the charts now? Since we use nginx there (and redirect /api to the ui chassis)
But we need to test before we know for sure.

Thanks for getting that underway :-)

We'll need to do the merge to get the containers published - the PR build just verifies the build, it does not push - nor can it (permissions). The merge will actually push them to quay.io & docker.io

Ok if we merge?

I have updated the charts to include this latest image, and to facilitate wider testing.

To install use:

helm repo update && helm search repo egeria --devel --set jupyter.tokenPlain='s3cr3t!' --set imageDefaults.pullPolicy=Always

These additional parms
a) pick the prerelease charts
b) set a specific password for jupyter (easier)
c) ensure the image is always pulled. This is the default for snapshots,but for releases we allow caching, but this image has been built multiple times, so to be sure...

However, the UI is not working for me. I now get a blank screen when going to the static ui service (previously this was a login panel). Let's discuss/debug when free

Update - incognito in chrome/safari shows different behaviour, so the blank screen was probably caching/cookies related.

I can now login - so we know the uichassis is working

Now debugging a hang
Screenshot 2023-03-29 at 08 13 35

This may be config related. Perhaps view service configuration?

No warnings being seen in the ui chassis pod

The following environment is currently set (as from old UI):

OMAS_SERVER_NAME=cocoMDS1
OMAS_SERVER_URL=https://lab-datalake:9443
OPEN_LINEAGE_GRAPH_SOURCE=MAIN
OPEN_LINEAGE_SERVER_NAME=cocoOLS1
OPEN_LINEAGE_SERVER_URL=https://lab-datalake:9443

Similar properties still seem appropriate as per the application.properties for ui chassis

Enabled debug, restarted ui pod, all ok - was able to login as erinoverview & garygeeke & retrieve assets 'Week*'

On first login it looks like a fetch of types is failing.

This is after the pods all report as up - but may be timing related
Screenshot 2023-03-29 at 08 38 22
Screenshot 2023-03-29 at 08 39 09

However when in this state, a logout/login does not fix this problem.

  • the fetch of types worked second time around, but the UI remained in the original 'circling' state (with the logo/greyed out)
  • once in this state, a reload of the UI returns to the same page. Logout is greyed out, so at this point one is 'stuck'

The initial error is startup timing. Debug logs show:

2023-03-29 07:35:57.397 - INFO 1 --- [           main] o.o.o.u.u.springboot.EgeriaUIPlatform    : Started EgeriaUIPlatform in 60.405 seconds (process running
 for 61.971)
2023-03-29 07:37:56.824 -ERROR 1 --- [nio-8443-exec-2] o.o.o.c.ffdc.RESTExceptionHandler        : Detected Invalid Parameter Exception in REST Response

org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException: OMAG-MULTI-TENANT-404-001 The OMAG Server cocoMDS1 is not available to service a
request from user erinoverview
        at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.throwInvalidParameterException(RESTExceptionHandler.java:289) ~[ffdc-services-4.0.j
ar!/:na]
        at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.detectAndThrowInvalidParameterException(RESTExceptionHandler.java:206) ~[ffdc-servi
ces-4.0.jar!/:na]
        at org.odpi.openmetadata.accessservices.assetcatalog.AssetCatalog.detectExceptions(AssetCatalog.java:307) ~[asset-catalog-client-4.0.jar!/:na]
        at org.odpi.openmetadata.accessservices.assetcatalog.AssetCatalog.getSupportedTypes(AssetCatalog.java:276) ~[asset-catalog-client-4.0.jar!/:na]
        at org.odpi.openmetadata.userinterface.uichassis.springboot.service.AssetCatalogOMASService.getSupportedTypes(AssetCatalogOMASService.java:236) ~[cla
sses!/:na]

This is due to the 'cocoMDS1' server 'built in' to the UI

NOTE: Having TWO cocoMDS1 servers in the same deployment -- as is the case for coco pharma - is confusing. I think it should be named something else (but not for this release)

The timing issue can probably be put in the release notes for now, though we should consider how the UI should handle that situation more elegantly -- and perhaps hint to the user to retry later. For most people experimenting, they'll likely start the UI after the cocoMDS1 has started. But, The UI should not hang without any exit route.

For proper deployments of egeria (a notebook based approach is quite particular for education, developers) & as we move forward with cloud native work, we need to clearly define an appropriate health check, so that we can coordinate when pods are ready to do real work, and have requests routed to them (a much biggger topic).

Repeated the test, but left about 7 minutes after deployment before running the UI (this was after the configure/data catalog notebooks).

Was able to login to the UI first time.

However also noticed 'About' does not load - again the swirling progress indicator

In summary, I think the UI is now deployed ok, but there are a number of issues we should open, and document in the release notes for 4.0 - but practically, it's probably what we should go with for this release

@bogdan-sava @sarbull @lpalashevski Does this make sense? Can you check what behaviour you see?
I can open up the issues & add to release notes - or please add your text if you wish?

Also

  • clicking on glossary icon on the left Hand Side, when reviewing search results, takes one back to the login page
  • clicking on profile gives blank page
  • the (i) button, similar to the menu item for about, gives spinning circle

These may still be in development

I have not raised an issue on the blank profile as I presume this is not developed yet, and at least does not cause an error, it's just there are no profile contents.

Given we can now deploy the UI, closing this issue

this should work now, there were some changes on the token implementation and cors functionality