Cray Advanced Platform Monitoring and Control provides a way to monitor and control certain components in a Shasta system. CAPMC uses a RESTful interface to provide monitoring and control capabilities and executes in the management plane in the SMS cluster. Administrator level permissions are required for most operations. The CAPMC service relies on a running Hardware State Manager (HSM) service. The HSM contains all of the necessary information for CAPMC to communicate with the hardware.
Building CAPMC after the Repo split
Starting capmcd:
./capmcd -http-listen="localhost:27777" -hsm=https://localhost:27779
By default the DB connection will try to connect to Postgres. Use the following ENV VARs to specify where to try to connect:
DB_HOSTNAME=somePostgresDB
DB_PORT=thePort
Example of CURL command to make sure it is working:
curl -X POST -i -d '{"nids":[7]}' http://localhost:27777/capmcd/get_node_status
From the root of this repo, build the image:
docker build -t cray/capmcd:1.0 .
Then run (add -d
to the arguments list of docker run
to run in detached/background mode):
docker run -p 27777:27777 --name capmcd cray/capmcd:1.0
All connections to localhost on port 27777 will flow through the running container.
Example to power on an entire cabinet:
cray capmc xname_on create --xnames x1000
Example to power off a Chassis an all of its descendents:
cray capmc xname_off create --xnames x1000c0 --recursive
./build_tag_push.sh -l :5000
On target system, delete the running pod and the one pushed will get started.
In addition to the service itself, this repository builds and publishes cray-capmc-test images containing tests that verify CAPMC on live Shasta systems. The tests are invoked via helm test as part of the Continuous Test (CT) framework during CSM installs and upgrades. The version of the cray-capmc-test image (vX.Y.Z) should match the version of the cray-capmc image being tested, both of which are specified in the helm chart for the service.
When the different APIs will be supported:
Equivalent XC | v1 now | v1 future |
---|---|---|
get_nid_map | get_nid_map | - |
get_node_rules | get_node_rules | - |
get_node_status | get_node_status | - |
node_on | node_on | - |
node_off | node_off | - |
node_reinit | node_reinit | - |
- | get_xname_status | - |
- | xname_on | - |
- | xname_off | - |
- | xname_reinit | - |
- | group_on | - |
- | group_off | - |
- | group_reinit | - |
- | get_group_status | - |
- | emergency_power_off | - |
get_power_cap_capabilities | get_power_cap_capabilities | - |
get_power_cap | get_power_cap | - |
set_power_cap | set_power_cap | - |
get_node_energy | get_node_energy | - |
get_node_energy_stats | get_node_energy_stats | - |
get_node_energy_counter | get_node_energy_counter | - |
get_system_power | get_system_power | - |
get_system_power_details | get_system_power_details | - |
get_system_parameters | get_system_parameters | - |
get_partition_map | - | get_partition_map |
- | - | get_partition_status |
- | - | partition_on |
- | - | partition_off |
- | - | partition_reinit |
- | - | get_gpu_power_cap_capabilities |
- | - | get_gpu_power_cap |
- | - | set_gpu_power_cap |
get_power_bias | - | get_power_bias (if needed) |
set_power_bias | - | set_power_bias (if needed) |
clr_power_bias | - | clr_power_bias (if needed) |
set_power_bias_data | - | set_power_bias_data (if needed) |
compute_power_bias | - | compute_power_bias (if needed) |
get_freq_capabilities | - | get_freq_capabilities (if needed ) |
get_freq_limits | - | get_freq_limits (if needed) |
set_freq_limits | - | set_freq_limits (if needed) |
get_sleep_state_limite_capabilities | - | get_sleep_state_limite_capabilities (if needed) |
set_sleep_state_limit | - | set_sleep_state_limit (if needed) |
get_sleep_state_limit | - | get_sleep_state_limit (if needed) |
get_mcdram_capabilities (Xeon Phi) | - | - |
get_mcdram_cfg (Xeon Phi) | - | - |
set_mcdram_cfg (Xeon Phi) | - | - |
clr_mcdram_cfg (Xeon Phi) | - | - |
get_numa_capabilities (Xeon Phi) | - | - |
get_numa_cfg (Xeon Phi) | - | - |
set_numa_cfg (Xeon Phi) | - | - |
clr_numa_cfg (Xeon Phi) | - | - |
get_ssd_enable (XC Only) | - | - |
set_ssd_enable (XC Only) | - | - |
clr_ssd_enable (XC Only) | - | - |
get_ssds (XC Only) | - | - |
get_ssd_diags (XC Only) | - | - |
- Power control
- Redfish power status of components
- Single components via NID or xname
- Grouped components
- Entire system (all or s0)
- Per cabinet (x1000)
- Ancestors and descendants of single component
- Force option for immediate power off
- Node power capping
- Emergency Power Off at the Chassis level
- Query of power data at node, system, and cabinet level
- Backend performance improvements
- Moving to a truly RESTful interface (v2)
- Power control
- Emergency Power Off at the iPDU levels
- Power control of Mountain CDUs (won't/cant do)
- Power control policies
- Power control of Motivair door fans
- Power control of in-rack River CDUs
- Power capping and related for Mountain
- Group level and system level power capping (if needed)
- Power bias factors to individual nodes (if needed)
- Query of power data at group level (if needed)
- RAPL (Running Average Power Limiting) (if possible)
- Node level CState/Pstate handling (if needed and not handled by WLM)
- GPU power capping
- Powering off idle nodes (most likely a WLM function)
- Rebooting nodes (most likely a CMS or WLM function)
- No Redfish interface to control Mountain CDUs
- CMM and CEC cannot be powered off. They are always ON when Mountain cabinets are plugged in and breakers are ON
- Can only talk to components that exist in HSM