Add redfish_chassis_temperature_sensor_health_state metric
Closed this issue · 4 comments
Hi,
The current temperature metrics looks like
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="CPU1 Temp", sensor_id="0"} 37
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="CPU2 Temp", sensor_id="1"} 32
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="System Board Exhaust Temp", sensor_id="4"} 30
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="System Board GPU7 Temp", sensor_id="3"} 32
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="System Board Inlet Temp", sensor_id="2"} 19
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="CPU1 Temp", sensor_id="0"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="CPU2 Temp", sensor_id="1"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="System Board Exhaust Temp", sensor_id="4"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="System Board GPU7 Temp", sensor_id="3"} 1
redfish_chassis_temperature_sensor_state{chassis_id="System.Embedded.1", instance="xxxx", job="redfish-exporter", resource="temperature", sensor="System Board Inlet Temp", sensor_id="2"} 1
Note: for the test I set the Warning threshold for sensor "System Board Inlet Temp" to 17.
The only state/health metrics > 1 in this case are:
redfish_system_health_state{cluster="steyr-prod-gpu",environment="prod",instance="steyr-prod-gpu__lp05edge02008",job="redfish-exporter",node="lp05edge02008",prometheus="victoriametrics/central",resource="system",scrape_from="edge-tooling",system_id="System.Embedded.1"} 2
redfish_chassis_health{chassis_id="System.Embedded.1",cluster="steyr-prod-gpu",environment="prod",instance="steyr-prod-gpu__lp05edge02008",job="redfish-exporter",node="lp05edge02008",prometheus="victoriametrics/central",resource="chassis",scrape_from="edge-tooling"} 2
So we in this case, when can only get a unspecific Chassis alert or need to define a Alert on the redfish_chassis_temperature_celsius using separate thresholds int the alert definition, which might not match the server configurations.
But the at least for our Dell servers also a Health value is provided via:
https:///redfish/v1/Chassis/System.Embedded.1/Sensors/SystemBoardInletTemp
e.g. for
{
"@odata.context": "/redfish/v1/$metadata#Sensor.Sensor",
"@odata.id": "/redfish/v1/Chassis/System.Embedded.1/Sensors/SystemBoardInletTemp",
"@odata.type": "#Sensor.v1_5_0.Sensor",
"Name": "System Board Inlet Temp",
"Id": "SystemBoardInletTemp",
"Description": "Instance of Sensor Id",
"ReadingType": "Temperature",
"ReadingUnits": "Cel",
"Status": {
"Health": "Warning",
"State": "Enabled"
},
"Reading": 20.0,
...
}
Can the redfish_exporter be extended by such a temperature health metric?
I checked the code, we have redfish_chassis_temperature_celsius
and redfish_chassis_temperature_sensor_state
, but we don't have redfish_chassis_temperature_sensor_health
, I will check if we can add redfish_chassis_temperature_sensor_health
@ulikl latest commit add such metric, please build and test since I don't have device
@jenningsloy318 , Thank you very much.
Its working
# HELP redfish_chassis_temperature_celsius celsius of temperature on this chassis component
# TYPE redfish_chassis_temperature_celsius gauge
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU1 Temp",sensor_id="0"} 36
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU2 Temp",sensor_id="1"} 36
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Exhaust Temp",sensor_id="3"} 37
redfish_chassis_temperature_celsius{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Inlet Temp",sensor_id="2"} 27
# HELP redfish_chassis_temperature_sensor_health status health of temperature on this chassis component,1(Enabled),2(Disabled),3(StandbyOffinline),4(StandbySpare),5(InTest),6(Starting),7(Absent),8(UnavailableOffline),9(Deferring),10(Quiesced),11(Updating)
# TYPE redfish_chassis_temperature_sensor_health gauge
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU1 Temp",sensor_id="0"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU2 Temp",sensor_id="1"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Exhaust Temp",sensor_id="3"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Inlet Temp",sensor_id="2"} 1
With inlet over warning:
# TYPE redfish_chassis_temperature_sensor_health gauge
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU1 Temp",sensor_id="0"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="CPU2 Temp",sensor_id="1"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Exhaust Temp",sensor_id="3"} 1
redfish_chassis_temperature_sensor_health{chassis_id="System.Embedded.1",resource="temperature",sensor="System Board Inlet Temp",sensor_id="2"} 2
if "2" means Warning, the HELP text is wrong, should be CommonHealthHelp
instead of CommonStateHelp
, no?