canonical/seldon-core-operator

`tests/integration/test_seldon_servers.py` is flaky in GitHub runners

DnPlas opened this issue · 3 comments

DnPlas commented

It seems like we have some flaky test executions in GitHub runners when running tests/integration/test_seldon_servers.py:

  1. It seems like GH runners sometimes are overloaded with these workloads and it may take more time than expected to deploy and verify SeldonDeployments are working correctly, causing errors like this: AssertionError: Waited too long for seldondeployment/mlflow!. A possible fix for this is to increase the timeout. A fix for this is being worked in #190

  2. The above causes the SeldonDeployments created by the test_seldon_predictor_server test case to not always be removed because the test case fails and doesn't have a step to ensure a cleanup between test cases. Since this test case is parametrised, it will try to deploy SeldonDeployments that may have the same name, which can cause conflicts if they are not correctly removed in a previous execution of the test case, which ends up in failures with message lightkube.core.exceptions.ApiError: seldondeployments.machinelearning.seldon.io "mlflow" already exists. This error can be found here. A fix for this is being worked in #188

DnPlas commented

For the first error, I have tried increasing the timeout, but it seems like doing that in GH runners could lead to juju client disconnection errors. More investigation is needed on that side.

For the second error, the fix is merged.

DnPlas commented

For the first error, I have tried increasing the timeout, but it seems like doing that in GH runners could lead to juju client disconnection errors. More investigation is needed on that side.

It seems like this error is intermittent and we have not yet discovered the root cause, but as a workaround re-trying the tests execution in the case of failure with a similar message to this:

  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 603, in rpc
    raise websockets.exceptions.ConnectionClosed(
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 0 (unknown), reason = websocket closed

can help alleviate the problem.

DnPlas commented

For the first error, I have tried increasing the timeout, but it seems like doing that in GH runners could lead to juju client disconnection errors. More investigation is needed on that side.

It seems like this error is intermittent and we have not yet discovered the root cause, but as a workaround re-trying the tests execution in the case of failure with a similar message to this:

  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 603, in rpc
    raise websockets.exceptions.ConnectionClosed(
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 0 (unknown), reason = websocket closed

can help alleviate the problem.

The PRs that were part of the fix for the two reported errors are now merged and that is the reason why this GH issue is closed.
The error I'm quoting above could still be present and attention must be taken when debugging it and ruling it out as a known issue:

  1. This error only happens when executing the tests/integration/test_seldon_servers.py tests
  2. An exception from juju is raised: websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 0 (unknown), reason = websocket closed due to the big timeout.
  3. This should only happen when executing the assert_available() test case.

If none of the above are true, the CI error is potentially caused by something else and should be investigated properly.