scylladb/scylla-cluster-tests

Argus is not setting `TEST-ERROR` for some tests

Closed this issue · 3 comments

For some reason this run and this run
Did not handle this event as TEST-ERROR

2024-09-06 00:01:56.534: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=289e2787-9d81-4d8b-b098-f8258e954662: node=Node longevity-10gb-3h-master-db-node-b1611080-eastus-2 [None | 10.0.0.6] message=Got spot termination event for node: ['longevity-10gb-3h-master-db-node-eastus-2']. VM eviction time is Fri, 06 Sep 2024 00:02:05 GMT.

My primary suspicion is that SCT runner closed before it got to event collection and the new bit we added that uploads the events during log collection should probably trigger the same logic.

Another thing is that they're both run on Azure, and maybe we handle something differently here, althout the regex for plain Spot Termination Event is very simple and I don't see yet what could be stopping it from being matched.

I think it's something on argus side, see argus.log shows:

< t:2024-09-10 02:22:38,581 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'test_error', 'status': 'ok'}
< t:2024-09-10 02:22:38,944 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Finalized', 'status': 'ok'}

I think it's something on argus side, see argus.log shows:

< t:2024-09-10 02:22:38,581 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'test_error', 'status': 'ok'}
< t:2024-09-10 02:22:38,944 f:base.py         l:65   c:argus.client.base    p:DEBUG > API Response: {'response': 'Finalized', 'status': 'ok'}

We also have last stage on the pipeline that checks status and sets it as well, could be there as well.

Here in fact:

https://github.com/scylladb/scylla-cluster-tests/blob/master/sct.py#L1786-L1791

I'm suprised we didn't see it earlier, will send a PR in a second.