`test_remove_with_resources_present` fails
Closed this issue · 3 comments
test_remove_with_resources_present
fails on the HEAD of main
so far the issue is narrowed down that only one CRD of possible 3 is possible in order not to fail.
That means that only one of those tests is allowed before test_remove_with_resources_present
:
- test_seldon_predictor_server[sklearn.yaml]
- test_seldon_predictor_server[sklearn2.yaml]
- test_seldon_deployment
if at least 2 from above list are present, then test test_remove_with_resources_present
fails
might be related to juju/python-libjuju#877
Tested without test_remove_with_resources_present
, but with all tests present. All tests pass, BUT same error is seen:
Task exception was never retrieved
future: <Task finished name='Task-614' coro=<Connection._connect.<locals>._try_endpoint() done, defined at /home/ichvets/cw/dev/seldon-core-operator/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py:758> exception=OSError(9, 'Bad file descriptor')>
Traceback (most recent call last):
Analysis
Failure to properly update ops_test.model.application
during the remove test is most likely due to the race condition that is created by deployment and deletion of multiple SeldonDeployments in other tests.
This appears to be observed only during inegration tests. If removed manually via Juju CLI, application is removed successfully.
In situations where remove test fails, stop
and remove
events are not recorded by jhack
. In situations where remove test succeeds, jhack
records stop
and remove
events and charm is removed properly. In some cases stop
event is triggered, but remove
event is not.
Failure case:
When remove test is executed, charm goes into terminated
state:
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
test-charm-ko7i microk8s-localhost microk8s/localhost 2.9.42 unsupported 16:19:43-04:00
App Version Status Scale Charm Channel Rev Address Exposed Message
seldon-controller-manager terminated 0/1 seldon-core 0 10.152.183.177 no unit stopped by the cloud
Unit Workload Agent Address Ports Message
seldon-controller-manager/0* terminated lost 10.1.59.75 unit stopped by the cloud
However, jhack tail
does not show stop
or remove
events:
$ jhack tail
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ timestamp ┃ seldon-controller-manager/0 ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 16:13:55 │ start │
│ 16:13:54 │ config_changed │
│ 16:13:52 │ seldon_core_pebble_ready │
│ 16:13:48 │ leader_elected │
│ 16:13:47 │ install │
└───────────┴─────────────────────────────┘
Suggestions/experimentation
Add wait_for_idle
calls to deployment tests to allow workload container to settle when deleting SeldonDeployments.
Add stop
event handler that stops workload container (in charm code).
Add resources
requirements to example SeldonDeployments (used in testing) to limit the amount resources they request to allow tests to be executed in Github runners that have limited resources.
Add 'grace_period=0
to Lightkube delete()
calls (in testing) when deleting test SeldonDeployments to ensure timely removal.