canonical/seldon-core-operator

Integration tests failing with bad file descriptor message

Closed this issue · 4 comments

Encountered during a PR that just updated a random text file in the repo #183

The tests were failing for Seldon with the following message:

OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.

https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183

Full logs
WARNING  juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
WARNING  juju.client.connection:connection.py:611 RPC: Connection closed, reconnecting
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2000' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
    res = await connector(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
    await self._connect(endpoints)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
    result = await task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
    return await self._open(endpoint, cacert)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line [402](https://github.com/canonical/seldon-core-operator/actions/runs/5963901210/job/16178014925?pr=183#step:5:403), in _open
    return (await websockets.connect(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
    transport, protocol = await self._creating_connection
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
    transport, protocol = await self._create_connection_transport(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
    sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-2001' coro=<Connection.reconnect() done, defined at /home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py:736> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 745, in reconnect
    res = await connector(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 823, in _connect_with_login
    await self._connect(endpoints)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 773, in _connect
    result = await task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 762, in _try_endpoint
    return await self._open(endpoint, cacert)
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/juju/client/connection.py", line 402, in _open
    return (await websockets.connect(
  File "/home/runner/work/seldon-core-operator/seldon-core-operator/.tox/seldon-servers-integration/lib/python3.8/site-packages/websockets/py35/client.py", line 12, in __await_impl__
    transport, protocol = await self._creating_connection
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
    transport, protocol = await self._create_connection_transport(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1066, in _create_connection_transport
    sock.setblocking(False)
OSError: [Errno 9] Bad file descriptor
ERROR    juju.client.connection:connection.py:619 RPC: Automatic reconnect failed
Error: The operation was canceled.

That error - Bad file descriptior - is an effect of tear down of test environment. Test environment is being removed when tests are finished. In this case integration tests are failing, most likely due resources.
This log is the real reason for this issue:

tests/integration/test_seldon_servers.py::test_seldon_predictor_server[MLFLOW_SERVER-mlflowserver.yaml-api/v1.0/predictions-request_data4-response_test_data4] 
-------------------------------- live log call ---------------------------------
INFO     httpx:_client.py:1013 HTTP Request: POST https://10.1.0.25:16443/apis/machinelearning.seldon.io/v1/namespaces/testing/seldondeployments?fieldManager=seldon-tests "HTTP/1.1 201 Created"
INFO     httpx:_client.py:1013 HTTP Request: GET https://10.1.0.25:16443/apis/machinelearning.seldon.io/v1/namespaces/testing/seldondeployments/mlflow "HTTP/1.1 200 OK"
INFO     test_seldon_servers:utils.py:29 seldondeployment/mlflow status == None (waiting for 'Available')
. . .
INFO     httpx:_client.py:1013 HTTP Request: GET https://10.1.0.25:16443/apis/machinelearning.seldon.io/v1/namespaces/testing/seldondeployments/mlflow "HTTP/1.1 200 OK"
INFO     test_seldon_servers:utils.py:25 Deployment of fseldondeployment/mlflow failed, status = Failed
FAILED

Integration test fail in Seldon from time to time. There was a fix for this, but it looks like Seldon still having this problem with integration tests.
#191

Do we delete each server after it's tested, or do we keep them all running?

async def test_seldon_predictor_server(

@kimwnasptd In main we do delete servers when we done testing them. It is done through fixture.
https://github.com/canonical/seldon-core-operator/blob/main/tests/integration/test_seldon_servers.py#L268

Problem is that integration test pass locally and we need to do debugging in GH runner. The plan was to move to self-hosted runners, but we hit issues with that setup. It is documented in our Jira.

We need to dedicate more time to investigate how we can remedy the problem.

Immediate steps that we can try (no guarantee of success):

  • Separate server testing even further (now it is done in its own Juju model, may be we can split them up even more to run in separate Juju models).
  • Investigate if size of the ML model that is used for testing is an issue and find smaller ML models for testing.

@kimwnasptd @phoevos I re-ran actions and tests passed:
#183

Seldon tests at their current state have 90% probablity passing on weekends when, I guess, GH runners are not used as much.
We will definitely need to investigate what is going on in the runner when this test is execuited.

Testing seldon servers in GH runners isn't be a problem anymore since we split testing each server to a different GH runner in #229. Given also that

error - Bad file descriptior - is an effect of tear down of test environment.

I 'll go ahead and close this. We can have a new issue for more specific issues with testing in GH runners.