Integration tests failing with bad file descriptor message
Closed this issue · 4 comments
Encountered during a PR that just updated a random text file in the repo #183
The tests were failing for Seldon with the following message:
OSError: [Errno 9] Bad file descriptor
ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.
https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183
Full logs
WARNING juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
WARNING juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
ERROR asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2000' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
res = await connector(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
await self._connect(endpoints)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
result = await task
File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
return await self._open(endpoint, cacert)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line [402](https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183#step:5:403), in _open
return (await websockets.connect(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
transport, protocol = await self._creating_connection
File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
ERROR asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2001' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
res = await connector(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
await self._connect(endpoints)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
result = await task
File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
return await self._open(endpoint, cacert)
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 402, in _open
return (await websockets.connect(
File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
transport, protocol = await self._creating_connection
File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.
That error - Bad file descriptior - is an effect of tear down of test environment. Test environment is being removed when tests are finished. In this case integration tests are failing, most likely due resources.
This log is the real reason for this issue:
tests/integration/test_seldon_servers.py::test_seldon_predictor_server[MLFLOW_SERVER-mlflowserver.yaml-api/v1.0/predictions-request_data4-response_test_data4]
-------------------------------- live log call ---------------------------------
INFO httpx:_client.py:1013 HTTP Request: POST https://10.1.0.25:16443/apis/machinelearning.seldon.io/v1/namespaces/testing/seldondeployments?fieldManager=seldon-tests "HTTP/1.1 201 Created"
INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.25:16443/apis/machinelearning.seldon.io/v1/namespaces/testing/seldondeployments/mlflow "HTTP/1.1 200 OK"
INFO test_seldon_servers:utils.py:29 seldondeployment/mlflow status == None (waiting for 'Available')
. . .
INFO httpx:_client.py:1013 HTTP Request: GET https://10.1.0.25:16443/apis/machinelearning.seldon.io/v1/namespaces/testing/seldondeployments/mlflow "HTTP/1.1 200 OK"
INFO test_seldon_servers:utils.py:25 Deployment of fseldondeployment/mlflow failed, status = Failed
FAILED
Integration test fail in Seldon from time to time. There was a fix for this, but it looks like Seldon still having this problem with integration tests.
#191
Do we delete each server after it's tested, or do we keep them all running?
@kimwnasptd In main
we do delete servers when we done testing them. It is done through fixture.
https://github.com/canonical/seldon-core-operator/blob/main/tests/integration/test_seldon_servers.py#L268
Problem is that integration test pass locally and we need to do debugging in GH runner. The plan was to move to self-hosted runners, but we hit issues with that setup. It is documented in our Jira.
We need to dedicate more time to investigate how we can remedy the problem.
Immediate steps that we can try (no guarantee of success):
- Separate server testing even further (now it is done in its own Juju model, may be we can split them up even more to run in separate Juju models).
- Investigate if size of the ML model that is used for testing is an issue and find smaller ML models for testing.
@kimwnasptd @phoevos I re-ran actions and tests passed:
#183
Seldon tests at their current state have 90% probablity passing on weekends when, I guess, GH runners are not used as much.
We will definitely need to investigate what is going on in the runner when this test is execuited.
Testing seldon servers in GH runners isn't be a problem anymore since we split testing each server to a different GH runner in #229. Given also that
error - Bad file descriptior - is an effect of tear down of test environment.
I 'll go ahead and close this. We can have a new issue for more specific issues with testing in GH runners.