/acm-inspector

Primary LanguagePythonMIT LicenseMIT

acm-inspector

Motivation

So you’re looking to:

  1. know if your ACM cluster is healthy because you want to
    • upgrade your hub?
    • or rollout a new stacks of applications and configuration policies?
  2. check if ACM hub can scale because:
    • you want to deploy a bunch of new clusters
  3. do capacity planning for future expansion

Well then you’ve come to the right place! We here aim to reveal, in a not so magical sort of way, the basic raw low level statistics about your current operating RHACM hub environment. Why no magic? Well we’re not quite there, yet, but see below for how you can contribute 🚀. And why basic raw and low level? Well we think this speaks the language of a cluster administrator, a platform engineer, a DevOps SRE, the Operations team that needs to keep the lights on. We tried to convey the information in its purest form so as not to introduce a bias, and allow the domain knowledge to indicate which next steps to take.

FAQ

  1. What is wrong with must-gather ?
    • must-gather is great for what it does. The aim here is slightly different. While must gather focus examines the current state using logs and Kube API, this tool focuses more on getting the historical metrics out of the Prometheus on the Hub Server. And along the process, it also looks at data from Custom Resources etc to be able to take a data driven (metric driven) approach. We are evaluating if this information can be collected automatically when must-gather is being run as well. After all, this information embellishes what is in must-gather.
  2. Why not SLOs?
    • Creating SLOs is orthogonal to this effort with some overlap. We do collect data here from metrics. We can expand this to see if SLO values are being met. From the data we can also tell if the usage of ACM is within the defined boundaries of ACM. And the goal is here to ultimately get inferences and recommendations automatically - AND NOT just reporting.
  3. Why not create dashboards out of this information?
    • One of the explicit goals is not to drown the user/ourselves with data. It is very appealing to throw the 50 or so graphs and csv generated by the code into grafana. But if you notice - doing a little bit of feature engineering - we have tried to consolidate all relevant important data into one master.csv. Once this data is sent to us, we run adhoc analytics on that data to give our recommedations. If you have domain knowledge about ACM, you could use XL tricks to get to an answer as well - but it would be laborious and time consuming.
  4. What is left to contribute ?
    • A lot. This is mentioned in Work-in-Progess section as well:

    • You could contribute to gather some more of raw data. More of operator data needs to collected etc. There is data from search postgres DB etc that needs to be collected.

    • You could help formalize the adhoc analytics that we run to give the recommedations. This is will be full off ACM domain knowledge. In other contribute to automatically draw inferences and recommendations from this data.

  5. Is Red Hat using this tool interally
    • Yes! We run this to gather data from our ACM perf-scale testing enviroment for analysis. We routinely tweak this repo based on experience of perf-scale and other complex customer cases.

ACM Domain Knowledge

One of the main goals of this tool is to investigate around scaling ACM. Below is a causal diagram of ACM from standpoint of scalability. This is definitely not a full complete ACM Causal diagram. That would be much more complex. Let us take a moment to review this figure and get the key idea behind this. Causal Diagram describing ACM Scalability Model

The big black dots are key drivers of ACM sizing along with the number of clusters it is managing. In other words, if we know the :

  • Num of apps & policies (ie how many applications and policies are defined on the cluster) and this depends on the cluster size. For the sake of brevity this node represents both applications and policies. So this works if there is only applications, or only policies or both.
  • Time series count (depends on how large the cluster is and what kind of work is running on them)
  • Resoure count (depends on how large the cluster is and what kind of work is running on them) ACM scaling model is conditionally independent of the real cluster size. Ofcourse the number of clusters is still important. You could appreciate that given this model, when we do real performance measurement, we can simulate/create a number of clusters with any size (could be kind cluster, could be Single Node OpenShift clusters) than clusters of specific sizes. It is much simpler to do the former instead of the latter.

So, to trace one line of the flow end to end: Num of apps & policies drives the API Server Object target count and size which in turn drives load on the ACM App & Policy Controllers. The ACM App & Policy Controllers are also influenced by the Cluster Count - ie number of clusters - into which the applications and policies these have to be replicated to. These in turn creates resources on the Kube API Server. These resources are created in etcd. Therefore etcd health is one of the key drivers of ACM Health. And etcd health is also dependent on Network health and Disk health.

Work-in-Progress

This is very much work in progress.

  • As of now, the argument it supports is just prom. This stands for in-cluster Prometheus on the Hub. We will extend it to ACM Observability Thanos in the very near future.

  • We have begun by looking at a few key operators of RHACM and grabbing the health of those.

  • We have also gathered current snapshot of a few prometheus metrics to check the health of the containers, API Server, etcd.

  • We also gather current snapshot of Prometheus alerts firing on the Hub Server.

  • We will continue to expand by looking at the entire set of RHACM operators (MCO, ManifestWork, Placement etc).

  • The next few bold steps could be

    - recommending an action to solve the problem by drawing inference from the output. 
    - inferring if the current size of the Hub cluster can handle more managed clusters.
    - inferring a pattern of usage of the Hub from which we could infer a new Hub size if the number of managed clusters increased from say x to 10x (10 to 300)
    

Note

Connection to in-cluster Prometheus works from OCP 4.10 upwards. That is because of route changes in open-shift-monitoring namespace.

To run this using your own python env

This has been tested using python 3.9.12. But if you do not want the hassle of setting this all up, use the Docker method.

  • clone this git repo

  • Activate venv:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r src/supervisor/requirements.txt
    
  • can be pip3 instead of pip depending on your python configuration. That is one more reason to use the docker container

  • After all work is done, to exit the venv, just run: deactivate

  • cd src/supervisor

  • connect to your OpenShift cluster that runs RHACM by oc login. You will need a kubeadmin access.

  • run python entry.py prom

    • (same disclaimer.. could be: python3.11 entry.py prom)
  • if you run python entry.py prom 2>&1 | tee ../../output/report.txt then all the output on the screen will also be redirected to output/report.txt file.

  • If you want to run the notebooks under causal-analysis, you will have install this: brew install graphviz on your macOS.

Using Docker

  1. To Build
    docker build -t acm-inspector .
    
  2. Or if you cannot build, you can simply download the image and make it locally available.
    docker pull quay.io/bjoydeep/acm-inspector:latest
    
  3. To Debug
    docker run -it acm-inspector /bin/sh
    
  4. To Run
    docker run -e OC_CLUSTER_URL=https://api.xxx.com:6443 -e OC_TOKEN=sha256~xyz -v /tmp:/acm-inspector/output quay.io/bjoydeep/acm-inspector
    
    Note: The image name needs to be changed from quay.io/bjoydeep/acm-inspector to acm-inspector if you are using the local docker build. And the volume (-v) points to /tmp for local machine. This should/could be changed depending on your need.

Result

  1. Historical metric results are created as png & csv files under output directory

    • There is output/breakdown directory which contains detailed png & csv files containing breakdown metrics for example by namespace or resource etc
    • Under the output directory all png & csv files contain metrics are grouped by time only
    • This allows us to merge all metrics under a csv called master.csv and to create a set of graphs (png) that start with the name master-*.png. These visually correlate how the system is performing with time as the number of managed clusters gets added.
  2. The master.csv has metrics that corresponding to almost each of the descendants of the big black dots. If some are missing, they should be added - and you can help! Let us take a look at what is there now.

    Node Metrics collected
    API Server Object target count and size Yes on count
    ACM App & Policy Operators Yes
    ACM API Need to be added - but should be same as API Server
    Kube API Server Yes
    etcd Yes
    etcd-Disk Yes
    etcd-Network Yes
    MCO Health Yes
    Thanos health Yes
    Obs Health Can be inferred
    Obs API Health To be added
    Indexer Yes
    Postgres Yes
    Search Health Can be inferred
    Search API Need to be added
    ACM Health Can be inferred
    Cluster Count Yes
    Num of apps & policies Yes
    Time Series Count Yes
    Kubernetes Resource Count Yes
  3. Current data is printed out in the screen as below:

Note: True in the ouput means good status (though this is not fully working yet).

Starting to Run ACM Health Check -  2022-05-29 09:10:17.130964

============================
MCH Operator Health Check
{'name': 'multiclusterhub', 'CurrentVersion': '2.5.0', 'DesiredVersion': '2.5.0', 'Phase': 'Running', 'Health': True}
 ============ MCH Operator Health Check ============  True
ACM Pod/Container Health Check
      container                              namespace  RestartCount
0       console                open-cluster-management             2
1        restic         open-cluster-management-backup             3
2  thanos-store  open-cluster-management-observability             3
==============================================
                           persistentvolumeclaim   AvailPct
0   alertmanager-db-observability-alertmanager-0  98.102710
1    data-observability-thanos-receive-default-1  99.447934
2               data-observability-thanos-rule-2  97.888564
3      data-observability-thanos-store-shard-2-0  97.913342
4   alertmanager-db-observability-alertmanager-1  98.102710
5            data-observability-thanos-compact-0  99.924190
6    data-observability-thanos-receive-default-0  99.447965
7               data-observability-thanos-rule-0  97.888964
8      data-observability-thanos-store-shard-1-0  98.589732
9                                    grafana-dev  97.843333
10  alertmanager-db-observability-alertmanager-2  98.102710
11   data-observability-thanos-receive-default-2  99.448000
12              data-observability-thanos-rule-1  97.888964
13     data-observability-thanos-store-shard-0-0  98.467535
==============================================
                                     namespace  PodCount
0               open-cluster-management-backup         8
1          open-cluster-management-agent-addon         9
2                  open-cluster-management-hub        12
3                      open-cluster-management        33
4  open-cluster-management-addon-observability         2
5        open-cluster-management-observability        33
6                open-cluster-management-agent         7
=============================================
            instance  etcdDBSizeMB
0  10.0.151.183:9979    325.195312
1   10.0.191.35:9979    325.320312
2   10.0.202.61:9979    326.210938
=============================================
            instance  LeaderChanges
0  10.0.151.183:9979              1
1   10.0.191.35:9979              1
2   10.0.202.61:9979              1
=============================================
                         alertname  value
0  APIRemovedInNextEUSReleaseInUse      3
1                  ArgoCDSyncAlert      3
=============================================
                       resource  APIServer99PctLatency
0        clusterserviceversions               4.290000
1                       backups               0.991571
2                 manifestworks               0.095667
3              multiclusterhubs               0.092000
4                  clusterroles               0.084417
5               managedclusters               0.083000
6                  authrequests               0.081500
7  projecthelmchartrepositories               0.072000
8              apirequestcounts               0.070978
9                     ingresses               0.070000
=============================================
 ============ ACM Pod/Container Health Check  -  PLEASE CHECK to see if the results are concerning!! ============
Managed Cluster Health Check
[{'managedName': 'alpha', 'creationTimestamp': '2022-05-27T19:35:51Z', 'health': True}, {'managedName': 'aws-arm', 'creationTimestamp': '2022-05-16T19:38:43Z', 'health': True}, {'managedName': 'local-cluster', 'creationTimestamp': '2022-05-06T02:25:59Z', 'health': True}, {'managedName': 'machine-learning-team-03', 'creationTimestamp': '2022-05-13T21:41:39Z', 'health': True}, {'managedName': 'pipeline-team-04', 'creationTimestamp': '2022-05-13T21:45:44Z', 'health': True}]
 ============ Managed Cluster Health Check passed ============  False
Checking Addon Health of  alpha
{'managedName': 'alpha', 'cluster-proxy': False, 'observability-controller': False, 'work-manager': False}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  aws-arm
{'managedName': 'aws-arm', 'application-manager': False, 'cert-policy-controller': False, 'cluster-proxy': False, 'config-policy-controller': False, 'governance-policy-framework': False, 'iam-policy-controller': False, 'managed-serviceaccount': False, 'observability-controller': False, 'search-collector': False, 'work-manager': False}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  local-cluster
{'managedName': 'local-cluster', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'observability-controller': False, 'work-manager': True}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  machine-learning-team-03
{'managedName': 'machine-learning-team-03', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'managed-serviceaccount': True, 'observability-controller': False, 'search-collector': True, 'work-manager': True}
 ============ Managed Cluster Addon Health Check passed ============  False
Checking Addon Health of  pipeline-team-04
{'managedName': 'pipeline-team-04', 'application-manager': True, 'cert-policy-controller': True, 'cluster-proxy': True, 'config-policy-controller': True, 'governance-policy-framework': True, 'iam-policy-controller': True, 'managed-serviceaccount': True, 'observability-controller': False, 'search-collector': True, 'work-manager': True}
 ============ Managed Cluster Addon Health Check passed ============  False
Node Health Check
{'name': 'ip-10-0-133-168.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-151-183.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-176-78.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-191-35.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-196-178.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
{'name': 'ip-10-0-202-61.us-east-2.compute.internal', 'MemoryPressure': 'False', 'DiskPressure': 'False', 'PIDPressure': 'False', 'Ready': 'True'}
 ============ Node Health Check passed ============  True
============================

 End ACM Health Check

Please contribute!

This is an open invitation to all RHACM users and developers to start to contribute so that we can acheive the end goal faster and improve this!