canonical/seldon-core-operator

`test_remove_with_resources_present` fails

Closed this issue · 3 comments

test_remove_with_resources_present fails on the HEAD of main

so far the issue is narrowed down that only one CRD of possible 3 is possible in order not to fail.
That means that only one of those tests is allowed before test_remove_with_resources_present:

  • test_seldon_predictor_server[sklearn.yaml]
  • test_seldon_predictor_server[sklearn2.yaml]
  • test_seldon_deployment

if at least 2 from above list are present, then test test_remove_with_resources_present fails

Tested without test_remove_with_resources_present, but with all tests present. All tests pass, BUT same error is seen:

Task exception was never retrieved
future: <Task finished name='Task-614' coro=<Connection._connect.<locals>._try_endpoint() done, defined at /home/ichvets/cw/dev/seldon-core-operator/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py:758> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):

Analysis
Failure to properly update ops_test.model.application during the remove test is most likely due to the race condition that is created by deployment and deletion of multiple SeldonDeployments in other tests.
This appears to be observed only during inegration tests. If removed manually via Juju CLI, application is removed successfully.

In situations where remove test fails, stop and remove events are not recorded by jhack. In situations where remove test succeeds, jhack records stop and remove events and charm is removed properly. In some cases stop event is triggered, but remove event is not.

Failure case:
When remove test is executed, charm goes into terminated state:

$ juju status
Model            Controller          Cloud/Region        Version  SLA          Timestamp
test-charm-ko7i  microk8s-localhost  microk8s/localhost  2.9.42   unsupported  16:19:43-04:00

App                        Version  Status      Scale  Charm        Channel  Rev  Address         Exposed  Message
seldon-controller-manager           terminated    0/1  seldon-core             0  10.152.183.177  no       unit stopped by the cloud

Unit                          Workload    Agent  Address     Ports  Message
seldon-controller-manager/0*  terminated  lost   10.1.59.75         unit stopped by the cloud

However, jhack tail does not show stop or remove events:

$ jhack tail
 ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                             
 ┃ timestamp ┃ seldon-controller-manager/0 ┃                                             
 ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                             
 │ 16:13:55  │ start                       │                                             
 │ 16:13:54  │ config_changed              │                                             
 │ 16:13:52  │ seldon_core_pebble_ready    │                                             
 │ 16:13:48  │ leader_elected              │                                             
 │ 16:13:47  │ install                     │                                             
 └───────────┴─────────────────────────────┘  

Suggestions/experimentation
Add wait_for_idle calls to deployment tests to allow workload container to settle when deleting SeldonDeployments.
Add stop event handler that stops workload container (in charm code).
Add resources requirements to example SeldonDeployments (used in testing) to limit the amount resources they request to allow tests to be executed in Github runners that have limited resources.
Add 'grace_period=0 to Lightkube delete() calls (in testing) when deleting test SeldonDeployments to ensure timely removal.