knix-microfunctions/knix

Faulty setup appears as running, but fails to create first user

manuelstein opened this issue · 3 comments

On K8s, upon a faulty setup, the components all show as up and running, but the storage may still not work.

In the particular case, DLService came up as it was able to connect to Riak. Riak did not succeed to form a cluster ring (one node still pending to join), but each riak node was responding to ping (no error, no restarts). DLService was able to connect to Riak, but probably didn't succeed in writing as the number of nodes was too low to achieve required number of replicas. Still, all components come up and show as running.

The first operation a user would do is to either create a user or login. When the DLService is degraded and can't store data, it simply results with "False" to operations. When Managementservice fails to store a new user, it returns and the GUI redirects the user to the login page with no information as to whether signing up was successful. Only by looking closely going through all component logs, it can be detected that there are problems and it requires familiarity with all the components to fix the faulty setup.

I think the prolongued Riak start-up script doesn't solve this entirely.
When we setup a single-node Riak cluster, the DatalayerService starts successfully, but can't write any data to Riak. In the logs it may show up that it can't create enough replicas. But that error doesn't get propagated to the workflows (management and user). A write only returns with false and the workflow can only guess the reason. One would need to dig down into the logs of DLService to find the cause. Should the platform retry failed writes? Should it maybe propagate back to the client that the workflow invocation failed with a proper reason?

Isn't the prolonged riak start-up essentially aiming to prevent a faulty setup to start with?

Of course, there may be other problems during runtime and they would have to handled, perhaps with additional measures.

ok, the particular case had a Riak node stuck in "joining". I think @ruichuan added a 5 second sleep in #73 to address this particular case. It'd be safer to fail the pod in the else branch ... but okay, when that ensures the Riak cluster comes up working, we might still have a setup that is degraded for other reasons but the components would be shown as up and running. E.g. when the DL looses connectivity to Riak, it still serves storage operations, but this is not shown to the user.

But agree, this particular issue is on the case that a Riak node is stuck in joining and #73 is supposed to fix this.
/close