server enter in strange loop after adding a data set in a cockpit (502 and 504 error NGINX), SEVERE java errors

Question

server enter in strange loop after adding a data set in a cockpit (502 and 504 error NGINX), SEVERE java errors

agaldemas opened this issue 3 years ago · 6 comments

Describe the bug
using Knowage 7.4.5, with helm chart,

when we add a dataset to a cockpit with existing selection widget, at save time, the cockpit update process, fails, and end with 502 and 504 errors, sent by NGINX, telling that the Knowage server is out of order....

To Reproduce

Steps to reproduce the behavior:

create a cockpit with a data set (REST), add some widget table, and a selector multi selection
select all items with the selector, save the cockpit with selection
back in edit mode, add a data set to the cockpit, and save the modification
the dialog box, disappear, and the cockpit try to update....until some errors message indicating failure to get data set comes to the web application, some 502 bad gateway errors, it ends with a NGINX 504 gateway timeout error...

Expected behavior

adding the data set to the cockpit should work, not putting down the server....
Screenshots

Additional context
the problem is visible with SEVERE errors in the container logs :
knowage-main-0_knowage-main (1).log

Answer 1 · 2021-09-06T07:14:26.000Z

Hi,

could you send us the knowage.log log from the Tomcat's log directory?

You can use something like:
kubectl cp mylabel-knowage-main-0:/home/knowage/apache-tomcat/logs/knowage.log ./knowage.log

Where mylabel needs to be substituted with the label you've used during the deployment of the Helm chart. See:
helm install mylabel .

Answer 2 · 2021-09-07T12:05:05.000Z

Hi Marco @kerny3d,

When the problem raise, nothing is written to knowage.log...

Answer 3 · 2021-09-07T12:14:30.000Z

It seems that the container was killed or crashed: if I remember well the 502 error from NGINX is simply saying that the service that exposes the Knowage main container had no pod to forward the request on while some times before it had some; that's different from 504 error (gateway timeout) because the gateway (NGINX) never reached Knowage main container.

Could you monitor your cluster with kubectl or Lens to see if the 502 matches a crash of the container? You could use something like:
kubetctl get pod --watch
And just wait to see 0/n pods available for Knowage main pod.

Answer 4 · 2021-09-09T10:41:44.000Z

Hello @kerny3d,

When the problem occurs, the knowage-main pod remains (the java process continue to run),

The nginx 504 gateway time out error, comes when you try to reload the page !

The pod is restarted by kubernetes after a while..., then we have 503 Service Temporarily Unavailable, while the pod is restarting !

By the way I don't catch how Kubernetes detect the problem, since the request used by the probes : https://<host>/knowage/restful-services/version, still works, while the server doesn't answer to request from the web application ????
Even this morning, despite a restart the knowage server was still stale, and I had to restart the knowage-main pod !
So strange !!!

the SEVERE errors comes in the pod's log, when the pod is stopped.

Note that this behavior is random, sometimes it's OK, but by insisting, removing/adding a dataset, the problem happens !

here is part of the pod's log, from tomcat start, to stop of the pod :

knowage-main-0_knowage-main-1.log

Hope this help !, but don't burn your time to find out, we'll upgrade to 7.5.9 as soon as available, and check if we can reproduce the problem with the helm chart, and with docker-compose, to compare...

Answer 5 · 2021-09-14T09:10:13.000Z

Hi @kerny3d, The same behavior occurs with auto refresh of the same cockpit with 2 REST data set, 1 table widget, 1 chart widget, and 1 selection widget, per data set.

This time I could grab knowage.log, where you can see cache access exceptions raising before the server seems stale, but only from the cockpit....
see file :
knowage.log

However only a part of the widget were not responding (the one for the data set cache error "Observation Qualité Air"), I could do preview of data sets, check CACHE_DATABASE (to knowage-cache) connection was ok !

By checking the logs of the knowage-cache pod i noticed several errors like :
2021-09-14 8:36:18 1189868 [ERROR] Invalid (old?) table or database name 'lost+found'

I had to restart the knowage server to recover all the widgets OK in the cockpit

Answer 6 · 2021-09-15T13:32:14.000Z

Hi @kerny3d, We clearly have an issue with our K8S cluster and storage management, so I close this issue !
Sorry for disturbance
CU