So you’re looking to:
- know if your ACM cluster is healthy because you want to
- upgrade your hub?
- or rollout a new stacks of applications and configuration policies?
- check if ACM hub can scale because:
- you want to deploy a bunch of new clusters
- do capacity planning for future expansion
Well then you’ve come to the right place! We here aim to reveal, in a not so magical sort of way, the basic raw low level statistics about your current operating RHACM hub environment. Why no magic? Well we’re not quite there, yet, but see below for how you can contribute 🚀. And why basic raw and low level? Well we think this speaks the language of a cluster administrator, a platform engineer, a DevOps SRE, the Operations team that needs to keep the lights on. We tried to convey the information in its purest form so as not to introduce a bias, and allow the domain knowledge to indicate which next steps to take.
- What is wrong with must-gather ?
- must-gather is great for what it does. The aim here is slightly different. While must gather focus examines the current state using logs and Kube API, this tool focuses more on getting the historical metrics out of the Prometheus on the Hub Server. And along the process, it also looks at data from Custom Resources etc to be able to take a data driven (metric driven) approach. We are evaluating if this information can be collected automatically when must-gather is being run as well. After all, this information embellishes what is in must-gather.
- Why not SLOs?
- Creating SLOs is orthogonal to this effort with some overlap. We do collect data here from metrics. We can expand this to see if SLO values are being met. From the data we can also tell if the usage of ACM is within the defined boundaries of ACM. And the goal is here to ultimately
get inferences and recommendations automatically
- AND NOT just reporting.
- Creating SLOs is orthogonal to this effort with some overlap. We do collect data here from metrics. We can expand this to see if SLO values are being met. From the data we can also tell if the usage of ACM is within the defined boundaries of ACM. And the goal is here to ultimately
- Why not create dashboards out of this information?
- One of the explicit goals is not to drown the user/ourselves with data. It is very appealing to throw the 50 or so graphs and csv generated by the code into grafana. But if you notice - doing a little bit of feature engineering - we have tried to consolidate all relevant important data into one master.csv. Once this data is sent to us, we run adhoc analytics on that data to give our recommedations. If you have domain knowledge about ACM, you could use XL tricks to get to an answer as well - but it would be laborious and time consuming.
- What is left to contribute ?
-
A lot. This is mentioned in Work-in-Progess section as well:
-
You could contribute to gather some more of raw data. More of operator data needs to collected etc. There is data from search postgres DB etc that needs to be collected.
-
You could help formalize the adhoc analytics that we run to give the recommedations. This is will be full off ACM domain knowledge. In other contribute to automatically draw
inferences
andrecommendations
from this data.
-
- Is Red Hat using this tool interally
- Yes! We run this to gather data from our ACM perf-scale testing enviroment for analysis. We routinely tweak this repo based on experience of perf-scale and other complex customer cases.
One of the main goals of this tool is to investigate around scaling ACM. Below is a causal diagram of ACM from standpoint of scalability. This is definitely not a full complete ACM Causal diagram
. That would be much more complex. Let us take a moment to review this figure and get the key idea behind this.
The big black dots are key drivers of ACM sizing along with the number of clusters it is managing. In other words, if we know the :
- Num of apps & policies (ie how many applications and policies are defined on the cluster) and this depends on the cluster size. For the sake of brevity this node represents both applications and policies. So this works if there is only applications, or only policies or both.
- Time series count (depends on how large the cluster is and what kind of work is running on them)
- Resoure count (depends on how large the cluster is and what kind of work is running on them)
ACM scaling model is
conditionally independent of the real cluster size
. Ofcourse the number of clusters is still important. You could appreciate that given this model, when we do real performance measurement, we can simulate/create a number of clusters with any size (could be kind cluster, could be Single Node OpenShift clusters) than clusters of specific sizes. It is much simpler to do the former instead of the latter.
So, to trace one line of the flow end to end:
Num of apps & policies
drives the API Server Object target count and size
which in turn drives load on the ACM App & Policy Controllers
. The ACM App & Policy Controllers
are also influenced by the Cluster Count
- ie number of clusters - into which the applications and policies these have to be replicated to. These in turn creates resources on the Kube API Server
. These resources are created in etcd
. Therefore etcd health is one of the key drivers of ACM Health
. And etcd
health is also dependent on Network health
and Disk health
.
This is very much work in progress.
-
As of now, the argument it supports is just
prom
. This stands forin-cluster Prometheus on the Hub
. We will extend it to ACM Observability Thanos in the very near future. -
We have begun by looking at a few key operators of RHACM and grabbing the health of those.
-
We have also gathered current snapshot of a few prometheus metrics to check the health of the containers, API Server, etcd.
-
We also gather current snapshot of Prometheus alerts firing on the Hub Server.
-
We will continue to expand by looking at the entire set of RHACM operators (MCO, ManifestWork, Placement etc).
-
The next few bold steps could be
- recommending an action to solve the problem by drawing inference from the output. - inferring if the current size of the Hub cluster can handle more managed clusters. - inferring a pattern of usage of the Hub from which we could infer a new Hub size if the number of managed clusters increased from say x to 10x (10 to 300)
Connection to in-cluster Prometheus works from OCP 4.10 upwards. That is because of route changes in open-shift-monitoring namespace.
This has been tested using python 3.9.12. But if you do not want the hassle of setting this all up, use the Docker method.
-
clone this git repo
-
Activate venv:
python -m venv .venv source .venv/bin/activate pip install -r src/supervisor/requirements.txt
-
can be pip3 instead of pip depending on your python configuration. That is one more reason to use the docker container
-
After all work is done, to exit the venv, just run: deactivate
-
cd src/supervisor
-
connect to your OpenShift cluster that runs RHACM by
oc login
. You will need a kubeadmin access. -
run
python entry.py prom
- (same disclaimer.. could be:
python3.11 entry.py prom
)
- (same disclaimer.. could be:
-
if you
runpython entry.py prom 2>&1 | tee ../../output/report.txt
then all the output on the screen will also be redirected tooutput/report.txt
file. -
If you want to run the notebooks under
causal-analysis
, you will have install this:brew install graphviz
on your macOS.
- To Build
docker build -t acm-inspector .
- Or if you cannot build, you can simply download the image and make it locally available.
docker pull quay.io/bjoydeep/acm-inspector:latest
- To Debug
docker run -it acm-inspector /bin/sh
- To Run
Note: The image name needs to be changed from
docker run -e OC_CLUSTER_URL=https://api.xxx.com:6443 -e OC_TOKEN=sha256~xyz -v /tmp:/acm-inspector/output quay.io/bjoydeep/acm-inspector
quay.io/bjoydeep/acm-inspector
toacm-inspector
if you are using the local docker build. And the volume (-v) points to/tmp
for local machine. This should/could be changed depending on your need.
-
Historical
metric results are created as png & csv files underoutput
directory- There is
output/breakdown
directory which contains detailed png & csv files containing breakdown metrics for example by namespace or resource etc - Under the
output
directory all png & csv files contain metrics are grouped by timeonly
- This allows us to merge all metrics under a csv called
master.csv
and to create a set of graphs (png) that start with the namemaster-*.png
. These visually correlate how the system is performing with time as the number of managed clusters gets added.
- There is
-
The
master.csv
has metrics that corresponding to almost each of the descendants of the big black dots. If some are missing, they should be added - and you can help! Let us take a look at what is there now.Node Metrics collected API Server Object target count and size Yes on count ACM App & Policy Operators Yes ACM API Need to be added - but should be same as API Server Kube API Server Yes etcd Yes etcd-Disk Yes etcd-Network Yes MCO Health Yes Thanos health Yes Obs Health Can be inferred Obs API Health To be added Indexer Yes Postgres Yes Search Health Can be inferred Search API Need to be added ACM Health Can be inferred Cluster Count Yes Num of apps & policies Yes Time Series Count Yes Kubernetes Resource Count Yes -
Current
data is printed out in the screen as below:
Note: True
in the ouput means good status (though this is not fully working yet).
Starting to Run ACM Health Check - 2022-05-29 09:10:17.130964
============================
MCH Operator Health Check
{'name': 'multiclusterhub', 'CurrentVersion': '2.5.0', 'DesiredVersion': '2.5.0', 'Phase': 'Running', 'Health': True}
============ MCH Operator Health Check ============ True
ACM Pod/Container Health Check
container namespace RestartCount
0 console open-cluster-management 2
1 restic open-cluster-management-backup 3
2 thanos-store open-cluster-management-observability 3
==============================================
persistentvolumeclaim AvailPct
0 alertmanager-db-observability-alertmanager-0 98.102710
1 data-observability-thanos-receive-default-1 99.447934
2 data-observability-thanos-rule-2 97.888564
3 data-observability-thanos-store-shard-2-0 97.913342
4 alertmanager-db-observability-alertmanager-1 98.102710
5 data-observability-thanos-compact-0 99.924190
6 data-observability-thanos-receive-default-0 99.447965
7 data-observability-thanos-rule-0 97.888964
8 data-observability-thanos-store-shard-1-0 98.589732
9 grafana-dev 97.843333
10 alertmanager-db-observability-alertmanager-2 98.102710
11 data-observability-thanos-receive-default-2 99.448000
12 data-observability-thanos-rule-1 97.888964
13 data-observability-thanos-store-shard-0-0 98.467535
==============================================
namespace PodCount
0 open-cluster-management-backup 8
1 open-cluster-management-agent-addon 9
2 open-cluster-management-hub 12
3 open-cluster-management 33
4 open-cluster-management-addon-observability 2
5 open-cluster-management-observability 33
6 open-cluster-management-agent 7
=============================================
instance etcdDBSizeMB
0 10.0.151.183:9979 325.195312
1 10.0.191.35:9979 325.320312
2 10.0.202.61:9979 326.210938
=============================================
instance LeaderChanges
0 10.0.151.183:9979 1
1 10.0.191.35:9979 1
2 10.0.202.61:9979 1
=============================================
alertname value
0 APIRemovedInNextEUSReleaseInUse 3
1 ArgoCDSyncAlert 3
=============================================
resource APIServer99PctLatency
0 clusterserviceversions 4.290000
1 backups 0.991571
2 manifestworks 0.095667
3 multiclusterhubs 0.092000
4 clusterroles 0.084417
5 managedclusters 0.083000
6 authrequests 0.081500
7 projecthelmchartrepositories 0.072000
8 apirequestcounts 0.070978
9 ingresses 0.070000
=============================================
============ ACM Pod/Container Health Check - PLEASE CHECK to see if the results are concerning!! ============
Managed Cluster Health Check
[{'managedName': 'alpha', 'creationTimestamp': '2022-05-27T19:35:51Z', 'health': True}, {'managedName': 'aws-arm', 'creationTimestamp': '2022-05-16T19:38:43Z', 'health': True}, {'managedName': 'local-cluster', 'creationTimestamp': '2022-05-06T02:25:59Z', 'health': True}, {'managedName': 'machine-learning-team-03', 'creationTimestamp': '2022-05-13T21:41:39Z', 'health': True}, {'managedName': 'pipeline-team-04', 'creationTimestamp': '2022-05-13T21:45:44Z', 'health': True}]
============ Managed Cluster Health Check passed ============ False
Checking Addon Health of alpha
{'managedName': 'alpha', 'cluster-proxy': False, 'observability-controller': False, 'work-manager': False}
============ Managed Cluster Addon Health Check passed ============ False
Checking Addon Health of aws-arm
{'managedName': 'aws-arm', 'application-manager': False, 'cert-policy-controller': False, 'cluster-proxy': False, 'config-policy-controller': False, 'governance-policy-framework': False, 'iam-policy-controller': False, 'managed-serviceaccount': False, 'observability-controller': False, 'search-collector': False, 'work-manager': False}
============ Managed Cluster Addon Health Check passed ============ False
Checking Addon Health of local-cluster
{'managedName': 'local-cluster', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'observability-controller': False, 'work-manager': True}
============ Managed Cluster Addon Health Check passed ============ False
Checking Addon Health of machine-learning-team-03
{'managedName': 'machine-learning-team-03', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'managed-serviceaccount': True, 'observability-controller': False, 'search-collector': True, 'work-manager': True}
============ Managed Cluster Addon Health Check passed ============ False
Checking Addon Health of pipeline-team-04
{'managedName': 'pipeline-team-04', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'managed-serviceaccount': True, 'observability-controller': False, 'search-collector': True, 'work-manager': True}
============ Managed Cluster Addon Health Check passed ============ False
Node Health Check
{'name': 'ip-10-0-133-168.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-151-183.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-176-78.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-191-35.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-196-178.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-202-61.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
============ Node Health Check passed ============ True
============================
End ACM Health Check
This is an open invitation to all RHACM users and developers to start to contribute so that we can acheive the end goal faster and improve this!