/hms-capmc

Primary LanguageGoMIT LicenseMIT

Cray Advanced Platform Monitoring and Control (CAPMC) Service

Cray Advanced Platform Monitoring and Control provides a way to monitor and control certain components in a Shasta system. CAPMC uses a RESTful interface to provide monitoring and control capabilities and executes in the management plane in the SMS cluster. Administrator level permissions are required for most operations. The CAPMC service relies on a running Hardware State Manager (HSM) service. The HSM contains all of the necessary information for CAPMC to communicate with the hardware.

Building and Executing CAPMC

Building CAPMC

Building CAPMC after the Repo split

Running CAPMC (capmcd) locally

Starting capmcd:

./capmcd -http-listen="localhost:27777" -hsm=https://localhost:27779

Caveats: Connecting to the TSDB

By default the DB connection will try to connect to Postgres. Use the following ENV VARs to specify where to try to connect:

DB_HOSTNAME=somePostgresDB
DB_PORT=thePort

Example of CURL command to make sure it is working:

curl -X POST -i -d '{"nids":[7]}' http://localhost:27777/capmcd/get_node_status

Running CAPMCD in Docker Container

From the root of this repo, build the image:

docker build -t cray/capmcd:1.0 .

Then run (add -d to the arguments list of docker run to run in detached/background mode):

docker run -p 27777:27777 --name capmcd cray/capmcd:1.0

All connections to localhost on port 27777 will flow through the running container.

Using CAPMC through the Cray CLI interface

Example to power on an entire cabinet:

cray capmc xname_on create --xnames x1000

Example to power off a Chassis an all of its descendents:

cray capmc xname_off create --xnames x1000c0 --recursive

Build, Tag, Push

./build_tag_push.sh -l :5000

On target system, delete the running pod and the one pushed will get started.

CAPMC CT Testing

In addition to the service itself, this repository builds and publishes cray-capmc-test images containing tests that verify CAPMC on live Shasta systems. The tests are invoked via helm test as part of the Continuous Test (CT) framework during CSM installs and upgrades. The version of the cray-capmc-test image (vX.Y.Z) should match the version of the cray-capmc image being tested, both of which are specified in the helm chart for the service.

API Map

When the different APIs will be supported:

Equivalent XC v1 now v1 future
get_nid_map get_nid_map -
get_node_rules get_node_rules -
get_node_status get_node_status -
node_on node_on -
node_off node_off -
node_reinit node_reinit -
- get_xname_status -
- xname_on -
- xname_off -
- xname_reinit -
- group_on -
- group_off -
- group_reinit -
- get_group_status -
- emergency_power_off -
get_power_cap_capabilities get_power_cap_capabilities -
get_power_cap get_power_cap -
set_power_cap set_power_cap -
get_node_energy get_node_energy -
get_node_energy_stats get_node_energy_stats -
get_node_energy_counter get_node_energy_counter -
get_system_power get_system_power -
get_system_power_details get_system_power_details -
get_system_parameters get_system_parameters -
get_partition_map - get_partition_map
- - get_partition_status
- - partition_on
- - partition_off
- - partition_reinit
- - get_gpu_power_cap_capabilities
- - get_gpu_power_cap
- - set_gpu_power_cap
get_power_bias - get_power_bias (if needed)
set_power_bias - set_power_bias (if needed)
clr_power_bias - clr_power_bias (if needed)
set_power_bias_data - set_power_bias_data (if needed)
compute_power_bias - compute_power_bias (if needed)
get_freq_capabilities - get_freq_capabilities (if needed )
get_freq_limits - get_freq_limits (if needed)
set_freq_limits - set_freq_limits (if needed)
get_sleep_state_limite_capabilities - get_sleep_state_limite_capabilities (if needed)
set_sleep_state_limit - set_sleep_state_limit (if needed)
get_sleep_state_limit - get_sleep_state_limit (if needed)
get_mcdram_capabilities (Xeon Phi) - -
get_mcdram_cfg (Xeon Phi) - -
set_mcdram_cfg (Xeon Phi) - -
clr_mcdram_cfg (Xeon Phi) - -
get_numa_capabilities (Xeon Phi) - -
get_numa_cfg (Xeon Phi) - -
set_numa_cfg (Xeon Phi) - -
clr_numa_cfg (Xeon Phi) - -
get_ssd_enable (XC Only) - -
set_ssd_enable (XC Only) - -
clr_ssd_enable (XC Only) - -
get_ssds (XC Only) - -
get_ssd_diags (XC Only) - -

Current Features

  • Power control
    • Redfish power status of components
    • Single components via NID or xname
    • Grouped components
    • Entire system (all or s0)
    • Per cabinet (x1000)
    • Ancestors and descendants of single component
    • Force option for immediate power off
    • Node power capping
    • Emergency Power Off at the Chassis level
    • Query of power data at node, system, and cabinet level

Future Features and updates

  • Backend performance improvements
  • Moving to a truly RESTful interface (v2)
  • Power control
    • Emergency Power Off at the iPDU levels
    • Power control of Mountain CDUs (won't/cant do)
    • Power control policies
    • Power control of Motivair door fans
    • Power control of in-rack River CDUs
  • Power capping and related for Mountain
    • Group level and system level power capping (if needed)
    • Power bias factors to individual nodes (if needed)
    • Query of power data at group level (if needed)
    • RAPL (Running Average Power Limiting) (if possible)
  • Node level CState/Pstate handling (if needed and not handled by WLM)
  • GPU power capping
  • Powering off idle nodes (most likely a WLM function)
  • Rebooting nodes (most likely a CMS or WLM function)

Limitations

  • No Redfish interface to control Mountain CDUs
  • CMM and CEC cannot be powered off. They are always ON when Mountain cabinets are plugged in and breakers are ON
  • Can only talk to components that exist in HSM