MatterMiners/tardis

Missing resource_status results is crashing Prometheus plugin

QuantumDancer opened this issue · 2 comments

Dear all,

we are currently migrating our C/T to the newest image (the old image is from September 2022 with a custom auditor plugin) and are seeing some crashes related to the Prometheus monitoring plugin. We are running the following image:

docker images --digests
REPOSITORY                                      TAG       DIGEST                                                                    IMAGE ID       CREATED        SIZE
matterminers/cobald-tardis                      latest    sha256:7b6fc72444eb7f25d8b17d6e957311fb4d7d5e3abed70aed4875e373aafcbafc   d2ca28594b2b   6 weeks ago    1.03GB

The crash happens when a drone changes to RequestState:

Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: cobald.runtime.tardis.plugins.prometheusmonitoring: 2023-07-10 08:03:24 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid':
 None, 'created': datetime.datetime(2023, 7, 10, 8, 3, 23, 922620), 'updated': datetime.datetime(2023, 7, 10, 8, 3, 24, 469225), 'drone_uuid': 'nemo-34fd16c8b9'} has changed state to RequestState
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: cobald.runtime.runner.asyncio: 2023-07-10 08:03:24 runner aborted: <cobald.daemon.runners.asyncio_runner.AsyncioRunner object at 0x7f51391150d0>
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: Traceback (most recent call last):
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/base_runner.py", line 68, in run
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     await self.manage_payloads()
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/asyncio_runner.py", line 54, in manage_payloads
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     await self._payload_failure
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/asyncio_runner.py", line 40, in _monitor_payload
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     result = await payload()
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 120, in run
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     await self.set_state(RequestState())
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 143, in set_state
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     await self.notify_plugins()
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 153, in notify_plugins
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     await plugin.notify(self.state, self.resource_attributes)
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/tardis/plugins/prometheusmonitoring.py", line 67, in notify
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     new_status = resource_attributes.resource_status
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:   File "/usr/local/lib/python3.8/site-packages/tardis/utilities/attributedict.py", line 17, in __getattr__
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]:     raise AttributeError(
Jul 10 10:03:24 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2129502]: AttributeError: resource_status is not a valid attribute. Dict contains {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 
'created': datetime.datetime(2023, 7, 10, 8, 3, 23, 922620), 'updated': datetime.datetime(2023, 7, 10, 8, 3, 24, 469225), 'drone_uuid': 'nemo-34fd16c8b9'}.

As the error message indicates, the attribute resource_status is missing in the resource_attributes dict. This dict is accessed in the notify method of the Prometheus plugin (line 67):

async def notify(self, state: State, resource_attributes: AttributeDict) -> None:
"""
Update Prometheus metrics at every state change
:param state: New state of the Drone
:type state: State
:param resource_attributes: Contains all meta-data of the Drone (created and
updated timestamps, dns name, unique id, site_name, machine_type, etc.)
:type resource_attributes: AttributeDict
:return: None
"""
if not self._svr_started:
await self.start()
logger.debug(f"Drone: {str(resource_attributes)} has changed state to {state}")
if resource_attributes.drone_uuid in self._drones:
old_status = self._drones[resource_attributes.drone_uuid]
self._gauges[old_status].dec({})
new_status = resource_attributes.resource_status
self._drones[resource_attributes.drone_uuid] = new_status
self._gauges[new_status].inc({})
if new_status == ResourceStatus.Deleted:
self._drones.pop(resource_attributes.drone_uuid, None)

After the service restarts C/T and some time, we hover see that the recource_status is now present in the resource_attributes dict:

Jul 10 10:03:38 monopol.bfg.privat docker-COBalD-Tardis-atlhei[2130347]: cobald.runtime.tardis.plugins.sqliteregistry: 2023-07-10 08:03:38 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16960754, 'created': datetime.datetime(2023, 7, 10, 8, 3, 38, 592070), 'updated': datetime.datetime(2023, 7, 10, 8, 3, 38, 592190), 'drone_uuid': 'nemo-34fd16c8b9', 'resource_status': <ResourceStatus.Booting: 1>} has changed state to BootingState

I can work on a fix, but I would need to know how we want the Prometheus plugin to behave. I currently have two ideas in mind:

  • just skip the prometheus update if the resource_status attribute is missing
  • set the new_status variable to a default value (e.g. BootingState) when the resource_status attribute is missing

But maybe you have also a different fix in mind.

Thanks a lot for the report, this looks like a potential bug. I will have look at let you know, how to fix it. I would avoid to do that in the Prometheus plugin itself.

Should be fixed when merging #301.