`tests/integration/test_seldon_servers.py` is flaky in GitHub runners
DnPlas opened this issue · 3 comments
It seems like we have some flaky test executions in GitHub runners when running tests/integration/test_seldon_servers.py
:
-
It seems like GH runners sometimes are overloaded with these workloads and it may take more time than expected to deploy and verify SeldonDeployments are working correctly, causing errors like this:
AssertionError: Waited too long for seldondeployment/mlflow!
. A possible fix for this is to increase the timeout. A fix for this is being worked in #190 -
The above causes the SeldonDeployments created by the
test_seldon_predictor_server
test case to not always be removed because the test case fails and doesn't have a step to ensure a cleanup between test cases. Since this test case is parametrised, it will try to deploy SeldonDeployments that may have the same name, which can cause conflicts if they are not correctly removed in a previous execution of the test case, which ends up in failures with messagelightkube.core.exceptions.ApiError: seldondeployments.machinelearning.seldon.io "mlflow" already exists
. This error can be found here. A fix for this is being worked in #188
For the first error, I have tried increasing the timeout, but it seems like doing that in GH runners could lead to juju client disconnection errors. More investigation is needed on that side.
For the second error, the fix is merged.
For the first error, I have tried increasing the timeout, but it seems like doing that in GH runners could lead to juju client disconnection errors. More investigation is needed on that side.
It seems like this error is intermittent and we have not yet discovered the root cause, but as a workaround re-trying the tests execution in the case of failure with a similar message to this:
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 603, in rpc
raise websockets.exceptions.ConnectionClosed(
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 0 (unknown), reason = websocket closed
can help alleviate the problem.
For the first error, I have tried increasing the timeout, but it seems like doing that in GH runners could lead to juju client disconnection errors. More investigation is needed on that side.
It seems like this error is intermittent and we have not yet discovered the root cause, but as a workaround re-trying the tests execution in the case of failure with a similar message to this:
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 603, in rpc raise websockets.exceptions.ConnectionClosed( websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 0 (unknown), reason = websocket closed
can help alleviate the problem.
The PRs that were part of the fix for the two reported errors are now merged and that is the reason why this GH issue is closed.
The error I'm quoting above could still be present and attention must be taken when debugging it and ruling it out as a known issue:
- This error only happens when executing the
tests/integration/test_seldon_servers.py
tests - An exception from juju is raised:
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 0 (unknown), reason = websocket closed
due to the big timeout. - This should only happen when executing the
assert_available()
test case.
If none of the above are true, the CI error is potentially caused by something else and should be investigated properly.