acm-inspector

Motivation

So you’re looking to:

know if your ACM cluster is healthy because you want to
- upgrade your hub?
- or rollout a new stacks of applications and configuration policies?
check if ACM hub can scale because:
- you want to deploy a bunch of new clusters
do capacity planning for future expansion

Well then you’ve come to the right place! We here aim to reveal, in a not so magical sort of way, the basic raw low level statistics about your current operating RHACM hub environment. Why no magic? Well we’re not quite there, yet, but see below for how you can contribute 🚀. And why basic raw and low level? Well we think this speaks the language of a cluster administrator, a platform engineer, a DevOps SRE, the Operations team that needs to keep the lights on. We tried to convey the information in its purest form so as not to introduce a bias, and allow the domain knowledge to indicate which next steps to take.

FAQ

What is wrong with must-gather ?
- must-gather is great for what it does. The aim here is slightly different. While must gather focus examines the current state using logs and Kube API, this tool focuses more on getting the historical metrics out of the Prometheus on the Hub Server. And along the process, it also looks at data from Custom Resources etc to be able to take a data driven (metric driven) approach. We are evaluating if this information can be collected automatically when must-gather is being run as well. After all, this information embellishes what is in must-gather.
Why not SLOs?
- Creating SLOs is orthogonal to this effort with some overlap. We do collect data here from metrics. We can expand this to see if SLO values are being met. From the data we can also tell if the usage of ACM is within the defined boundaries of ACM. And the goal is here to ultimately get inferences and recommendations automatically - AND NOT just reporting.
Why not create dashboards out of this information?
- One of the explicit goals is not to drown the user/ourselves with data. It is very appealing to throw the 50 or so graphs and csv generated by the code into grafana. But if you notice - doing a little bit of feature engineering - we have tried to consolidate all relevant important data into one master.csv. Once this data is sent to us, we run adhoc analytics on that data to give our recommedations. If you have domain knowledge about ACM, you could use XL tricks to get to an answer as well - but it would be laborious and time consuming.
What is left to contribute ?
- A lot. This is mentioned in Work-in-Progess section as well:
- You could contribute to gather some more of raw data. More of operator data needs to collected etc. There is data from search postgres DB etc that needs to be collected.
- You could help formalize the adhoc analytics that we run to give the recommedations. This is will be full off ACM domain knowledge. In other contribute to automatically draw inferences and recommendations from this data.
Is Red Hat using this tool interally
- Yes! We run this to gather data from our ACM perf-scale testing enviroment for analysis. We routinely tweak this repo based on experience of perf-scale and other complex customer cases.

ACM Domain Knowledge

One of the main goals of this tool is to investigate around scaling ACM. Below is a causal diagram of ACM from standpoint of scalability. This is definitely not a full complete ACM Causal diagram. That would be much more complex. Let us take a moment to review this figure and get the key idea behind this.

The big black dots are key drivers of ACM sizing along with the number of clusters it is managing. In other words, if we know the :

Num of apps & policies (ie how many applications and policies are defined on the cluster) and this depends on the cluster size. For the sake of brevity this node represents both applications and policies. So this works if there is only applications, or only policies or both.
Time series count (depends on how large the cluster is and what kind of work is running on them)
Resoure count (depends on how large the cluster is and what kind of work is running on them) ACM scaling model is conditionally independent of the real cluster size. Ofcourse the number of clusters is still important. You could appreciate that given this model, when we do real performance measurement, we can simulate/create a number of clusters with any size (could be kind cluster, could be Single Node OpenShift clusters) than clusters of specific sizes. It is much simpler to do the former instead of the latter.

So, to trace one line of the flow end to end: Num of apps & policies drives the API Server Object target count and size which in turn drives load on the ACM App & Policy Controllers. The ACM App & Policy Controllers are also influenced by the Cluster Count - ie number of clusters - into which the applications and policies these have to be replicated to. These in turn creates resources on the Kube API Server. These resources are created in etcd. Therefore etcd health is one of the key drivers of ACM Health. And etcd health is also dependent on Network health and Disk health.

Work-in-Progress

This is very much work in progress.

As of now, the argument it supports is just prom. This stands for in-cluster Prometheus on the Hub. We will extend it to ACM Observability Thanos in the very near future.
We have begun by looking at a few key operators of RHACM and grabbing the health of those.
We have also gathered current snapshot of a few prometheus metrics to check the health of the containers, API Server, etcd.
We also gather current snapshot of Prometheus alerts firing on the Hub Server.
We will continue to expand by looking at the entire set of RHACM operators (MCO, ManifestWork, Placement etc).

The next few bold steps could be

- recommending an action to solve the problem by drawing inference from the output. 
- inferring if the current size of the Hub cluster can handle more managed clusters.
- inferring a pattern of usage of the Hub from which we could infer a new Hub size if the number of managed clusters increased from say x to 10x (10 to 300)

Note

Connection to in-cluster Prometheus works from OCP 4.10 upwards. That is because of route changes in open-shift-monitoring namespace.

To run this using your own python env

This has been tested using python 3.9.12. But if you do not want the hassle of setting this all up, use the Docker method.

clone this git repo

Activate venv:

python -m venv .venv
source .venv/bin/activate
pip install -r src/supervisor/requirements.txt

can be pip3 instead of pip depending on your python configuration. That is one more reason to use the docker container
After all work is done, to exit the venv, just run: deactivate
cd src/supervisor
connect to your OpenShift cluster that runs RHACM by oc login. You will need a kubeadmin access.
run python entry.py prom
- (same disclaimer.. could be: python3.11 entry.py prom)
if you run python entry.py prom 2>&1 | tee ../../output/report.txt then all the output on the screen will also be redirected to output/report.txt file.
If you want to run the notebooks under causal-analysis, you will have install this: brew install graphviz on your macOS.

Using Docker

To Build
```
docker build -t acm-inspector .
```
Or if you cannot build, you can simply download the image and make it locally available.
```
docker pull quay.io/bjoydeep/acm-inspector:latest
```
To Debug
```
docker run -it acm-inspector /bin/sh
```
To Run
```
docker run -e OC_CLUSTER_URL=https://api.xxx.com:6443 -e OC_TOKEN=sha256~xyz -v /tmp:/acm-inspector/output quay.io/bjoydeep/acm-inspector
```
Note: The image name needs to be changed from quay.io/bjoydeep/acm-inspector to acm-inspector if you are using the local docker build. And the volume (-v) points to /tmp for local machine. This should/could be changed depending on your need.

Result

Historical metric results are created as png & csv files under output directory
- There is output/breakdown directory which contains detailed png & csv files containing breakdown metrics for example by namespace or resource etc
- Under the output directory all png & csv files contain metrics are grouped by time only
- This allows us to merge all metrics under a csv called master.csv and to create a set of graphs (png) that start with the name master-*.png. These visually correlate how the system is performing with time as the number of managed clusters gets added.

The master.csv has metrics that corresponding to almost each of the descendants of the big black dots. If some are missing, they should be added - and you can help! Let us take a look at what is there now.

Node	Metrics collected
API Server Object target count and size	Yes on count
ACM App & Policy Operators	Yes
ACM API	Need to be added - but should be same as API Server
Kube API Server	Yes
etcd	Yes
etcd-Disk	Yes
etcd-Network	Yes
MCO Health	Yes
Thanos health	Yes
Obs Health	Can be inferred
Obs API Health	To be added
Indexer	Yes
Postgres	Yes
Search Health	Can be inferred
Search API	Need to be added
ACM Health	Can be inferred
Cluster Count	Yes
Num of apps & policies	Yes
Time Series Count	Yes
Kubernetes Resource Count	Yes

Current data is printed out in the screen as below:

Note: True in the ouput means good status (though this is not fully working yet).

Starting to Run ACM Health Check -  2022-05-29 09:10:17.130964

============================
MCH Operator Health Check
{'name': 'multiclusterhub', 'CurrentVersion': '2.5.0', 'DesiredVersion': '2.5.0', 'Phase': 'Running', 'Health': True}
 ============ MCH Operator Health Check ============  True
ACM Pod/Container Health Check
      container                              namespace  RestartCount
0       console                open-cluster-management             2
1        restic         open-cluster-management-backup             3
2  thanos-store  open-cluster-management-observability             3
==============================================
                           persistentvolumeclaim   AvailPct
0   alertmanager-db-observability-alertmanager-0  98.102710
1    data-observability-thanos-receive-default-1  99.447934
2               data-observability-thanos-rule-2  97.888564
3      data-observability-thanos-store-shard-2-0  97.913342
4   alertmanager-db-observability-alertmanager-1  98.102710
5            data-observability-thanos-compact-0  99.924190
6    data-observability-thanos-receive-default-0  99.447965
7               data-observability-thanos-rule-0  97.888964
8      data-observability-thanos-store-shard-1-0  98.589732
9                                    grafana-dev  97.843333
10  alertmanager-db-observability-alertmanager-2  98.102710
11   data-observability-thanos-receive-default-2  99.448000
12              data-observability-thanos-rule-1  97.888964
13     data-observability-thanos-store-shard-0-0  98.467535
==============================================
                                     namespace  PodCount
0               open-cluster-management-backup         8
1          open-cluster-management-agent-addon         9
2                  open-cluster-management-hub        12
3                      open-cluster-management        33
4  open-cluster-management-addon-observability         2
5        open-cluster-management-observability        33
6                open-cluster-management-agent         7
=============================================
            instance  etcdDBSizeMB
0  10.0.151.183:9979    325.195312
1   10.0.191.35:9979    325.320312
2   10.0.202.61:9979    326.210938
=============================================
            instance  LeaderChanges
0  10.0.151.183:9979              1
1   10.0.191.35:9979              1
2   10.0.202.61:9979              1
=============================================
                         alertname  value
0  APIRemovedInNextEUSReleaseInUse      3
1                  ArgoCDSyncAlert      3
=============================================
                       resource  APIServer99PctLatency
0        clusterserviceversions               4.290000
1                       backups               0.991571
2                 manifestworks               0.095667
3              multiclusterhubs               0.092000
4                  clusterroles               0.084417
5               managedclusters               0.083000
6                  authrequests               0.081500
7  projecthelmchartrepositories               0.072000
8              apirequestcounts               0.070978
9                     ingresses               0.070000
=============================================
 ============ ACM Pod/Container Health Check  -  PLEASE CHECK to see if the results are concerning!! ============
Managed Cluster Health Check
[{'managedName': 'alpha', 'creationTimestamp': '2022-05-27T19:35:51Z', 'health': True}, {'managedName': 'aws-arm', 'creationTimestamp': '2022-05-16T19:38:43Z', 'health': True}, {'managedName': 'local-cluster', 'creationTimestamp': '2022-05-06T02:25:59Z', 'health': True}, {'managedName': 'machine-learning-team-03', 'creationTimestamp': '2022-05-13T21:41:39Z', 'health': True}, {'managedName': 'pipeline-team-04', 'creationTimestamp': '2022-05-13T21:45:44Z', 'health': True}]
 ============ Managed Cluster Health Check passed ============  False
Checking Addon Health of  alpha
{'managedName': 'alpha', 'cluster-proxy': False, 'observability-controller': False, 'work-manager': False}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  aws-arm
{'managedName': 'aws-arm', 'application-manager': False, 'cert-policy-controller': False, 'cluster-proxy': False, 'config-policy-controller': False, 'governance-policy-framework': False, 'iam-policy-controller': False, 'managed-serviceaccount': False, 'observability-controller': False, 'search-collector': False, 'work-manager': False}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  local-cluster
{'managedName': 'local-cluster', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'observability-controller': False, 'work-manager': True}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  machine-learning-team-03
{'managedName': 'machine-learning-team-03', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'managed-serviceaccount': True, 'observability-controller': False, 'search-collector': True, 'work-manager': True}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  pipeline-team-04
{'managedName': 'pipeline-team-04', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'managed-serviceaccount': True, 'observability-controller': False, 'search-collector': True, 'work-manager': True}
 ============ Managed Cluster Addon Health Check passed ============  False
Node Health Check
{'name': 'ip-10-0-133-168.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-151-183.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-176-78.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-191-35.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-196-178.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-202-61.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
 ============ Node Health Check passed ============  True
============================

 End ACM Health Check

Please contribute!

This is an open invitation to all RHACM users and developers to start to contribute so that we can acheive the end goal faster and improve this!

bjoydeep/acm-inspector