Missing resource_status results is crashing Prometheus plugin
QuantumDancer opened this issue · 2 comments
Dear all,
we are currently migrating our C/T to the newest image (the old image is from September 2022 with a custom auditor plugin) and are seeing some crashes related to the Prometheus monitoring plugin. We are running the following image:
docker images --digests
REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE
matterminers/cobald-tardis latest sha256:7b6fc72444eb7f25d8b17d6e957311fb4d7d5e3abed70aed4875e373aafcbafc d2ca28594b2b 6 weeks ago 1.03GB
The crash happens when a drone changes to RequestState:
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: cobald.runtime.tardis.plugins.prometheusmonitoring: 2023-07-10 08:03:24 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid':
None, 'created': datetime.datetime(2023, 7, 10, 8, 3, 23, 922620), 'updated': datetime.datetime(2023, 7, 10, 8, 3, 24, 469225), 'drone_uuid': 'nemo-34fd16c8b9'} has changed state to RequestState
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: cobald.runtime.runner.asyncio: 2023-07-10 08:03:24 runner aborted: <cobald.daemon.runners.asyncio_runner.AsyncioRunner object at 0x7f51391150d0>
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: Traceback (most recent call last):
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/base_runner.py", line 68, in run
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: await self.manage_payloads()
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/asyncio_runner.py", line 54, in manage_payloads
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: await self._payload_failure
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/asyncio_runner.py", line 40, in _monitor_payload
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: result = await payload()
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 120, in run
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: await self.set_state(RequestState())
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 143, in set_state
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: await self.notify_plugins()
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 153, in notify_plugins
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: await plugin.notify(self.state, self.resource_attributes)
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/tardis/plugins/prometheusmonitoring.py", line 67, in notify
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: new_status = resource_attributes.resource_status
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: File "/usr/local/lib/python3.8/site-packages/tardis/utilities/attributedict.py", line 17, in __getattr__
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: raise AttributeError(
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: AttributeError: resource_status is not a valid attribute. Dict contains {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None,
'created': datetime.datetime(2023, 7, 10, 8, 3, 23, 922620), 'updated': datetime.datetime(2023, 7, 10, 8, 3, 24, 469225), 'drone_uuid': 'nemo-34fd16c8b9'}.
As the error message indicates, the attribute resource_status
is missing in the resource_attributes
dict. This dict is accessed in the notify
method of the Prometheus plugin (line 67):
tardis/tardis/plugins/prometheusmonitoring.py
Lines 47 to 73 in 0f76db0
After the service restarts C/T and some time, we hover see that the recource_status
is now present in the resource_attributes
dict:
Jul 10 10:03:38 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2130347]: cobald.runtime.tardis.plugins.sqliteregistry: 2023-07-10 08:03:38 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16960754, 'created': datetime.datetime(2023, 7, 10, 8, 3, 38, 592070), 'updated': datetime.datetime(2023, 7, 10, 8, 3, 38, 592190), 'drone_uuid': 'nemo-34fd16c8b9', 'resource_status': <ResourceStatus.Booting: 1>} has changed state to BootingState
I can work on a fix, but I would need to know how we want the Prometheus plugin to behave. I currently have two ideas in mind:
- just skip the prometheus update if the
resource_status
attribute is missing - set the
new_status
variable to a default value (e.g.BootingState
) when theresource_status
attribute is missing
But maybe you have also a different fix in mind.
Thanks a lot for the report, this looks like a potential bug. I will have look at let you know, how to fix it. I would avoid to do that in the Prometheus plugin itself.