/performance-analyzer-rca

The Performance Analyzer RCA is a framework that builds on the Performance Analyzer engine to support root cause analysis (RCA) of performance and reliability problems for OpenSearch instances.

Primary LanguageJavaApache License 2.0Apache-2.0

Java CI codecov Documentation Chat PRs welcome!

OpenSearch Performance Analyzer RCA

The Performance Analyzer RCA is a framework that builds on the Performance Analyzer engine to support Root Cause Analysis (RCA) of performance and reliability problems in OpenSearch clusters. This framework executes real time root cause analyses using Performance Analyzer metrics. Root cause analysis can significantly improve operations, administration and provisioning of OpenSearch clusters, and it can enable OpenSearch client teams to tune their workloads to reduce errors.

RCA Overview

The RCA framework is modelled as a distributed data-flow graph where data flows downstream from the leaf nodes to the root. Leaf nodes of the graph represent Performance Analyzer metrics on which intermediate computations are performed. The intermediate nodes can be RCAs or other derived symptoms which helps in computation of the final RCA. The framework operates on a single analysis graph which is composed of multiple connected-components. Each connected component contains one top-level RCA at its root.

Terminologies

DataStream a.k.a FlowUnit: A flow-unit is a data object that is exchanged between nodes. Each unit contains a timestamp and the data generated at that timestamp. Data flows downstream from the leaf nodes to the root.

RCA: An RCA is a function which operates on multiple datastreams. These datastreams can have zero or more metrics, symptoms or other RCAs.

Symptom: A symptom is an intermediate function operating on metric or symptom datastreams. For example - A high CPU utilization symptom can calculate moving average over k samples and categorize CPU utilization(high/low) based on a threshold. A symptom node typically outputs a boolean - presence(true) or absence(false) of a symptom.

Metrics: Metrics are typically served as continuous datastreams to downstream nodes in the graph from Performance Analyzer (PA). The RCA framework can also be extended to pull custom metrics from other data sources.

Components

Framework: The RCA runtime operates on an AnalysisGraph. You can extend this class and override the construct method to build your own RCAs. You should specify the path to the class in the analysis-graph-implementor section of config/opensearch-performance-analyzer/rca.conf. The addLeaf and addAllUpstreams helper methods are useful when you define the dependencies between nodes of the graph.

Scheduler: The scheduler invokes the operate method on each graph node in topological order as defined in the AnalysisGraph. Nodes with no dependencies can be executed in parallel. Use flow-units to share data between RCA nodes instead of shared objects to avoid data races and performance bottlenecks.

Networking: The networking layer handles RPCs for remote RCA execution. If a host depends on a remote datastream for RCA computation, it subscribes to the datastream on startup. Subsequently, the output of every RCA execution on the upstream host is streamed to the downstream subscriber.

WireHopper: Interface between the scheduler and the networking layer to help in remote RCA execution.

Context: The context contains a brief summary of the RCA. For example, a High CPU utilization symptom context will contain the average CPU utilization when the symptom was triggered.

Thresholds: Thresholds are static values that must be exceeded to trigger symptoms and RCAs. Thresholds can be dynamically updated and do not require a process restart. Thresholds are user defined and often depend on hardware configuration and OpenSearch versions. The threshold store supports tags to help associate metadata with a threshold.

Tags: Tags are key-value pairs that are specified in the configuration file(rca.conf). Tags can be associated with both hosts and RCA nodes.

  • RCA nodes are only executed on hosts with the exact same tags as the RCA node. A common use-case of tags is to restrict certain RCA nodes to only execute on the cluster_manager node.
  • Tags are also used by hosts to find and subscribe to remote datastreams. For example, a cluster-wide RCA running on the cluster_manager can subscribe to datastreams from all data hosts in the cluster.

Design RFC

RFC

Rest API to get the RCAs

  • To get response for all the available RCA, use:
curl --url "localhost:9600/_plugins/_performanceanalyzer/rca" -XGET
  • To get response for a specific RCA, use:
curl --url "localhost:9600/_plugins/_performanceanalyzer/rca?name=HighHeapUsageClusterRca" -XGET

The sample RCA response from above api

{
    "HighHeapUsageClusterRca": [
        {
            "rca_name": "HighHeapUsageClusterRca",
            "state": "unhealthy",
            "timestamp": 1587426650942,
            "HotClusterSummary": [
                {
                    "number_of_nodes": 2,
                    "number_of_unhealthy_nodes": 1,
                    "HotNodeSummary": [
                        {
                            "host_address": "192.168.144.2",
                            "node_id": "JtlEoRowSI6iNpzpjlbp_Q",
                            "HotResourceSummary": [
                                {
                                    "resource_type": "old gen",
                                    "threshold": 0.65,
                                    "value": 0.81827232588145373,
                                    "avg": NaN,
                                    "max": NaN,
                                    "min": NaN,
                                    "unit_type": "heap usage in percentage",
                                    "time_period_seconds": 600,
                                    "TopConsumerSummary": [
                                        {
                                            "name": "CACHE_FIELDDATA_SIZE",
                                            "value": 590702564
                                        },
                                        {
                                            "name": "CACHE_REQUEST_SIZE",
                                            "value": 28375
                                        },
                                        {
                                            "name": "CACHE_QUERY_SIZE",
                                            "value": 12687
                                        }
                                    ],
                                }
                            ]
                        }
                    ] 
                }
            ]
        }
    ]
}

Temperature profiles

There are ways to get the temperature profile of a cluster and of individual nodes.

The cluster level RCA can only be queried from the elected cluster_manager using the following rest API.

curl "localhost:9600/_plugins/_performanceanalyzer/rca?name=ClusterTemperatureRca"

In order to get the temperature of a particular node, we can use:

curl "localhost:9600/_plugins/_performanceanalyzer/rca?name=AllTemperatureDimensions&local=true"

Rest API to get the ACTIONS suggested by the decider

This api returns the last suggested action set by the decider framework.

curl --url "localhost:9600/_plugins/_performanceanalyzer/actions" -XGET

The sample response from the above API:

{
    "LastSuggestedActionSet": [
        {
            "actionName": "ModifyQueueCapacity",
            "actionable": true,
            "coolOffPeriod": 300000,
            "muted": false,
            "nodeIds": "{oSN5LEstThe25y8NrCGyAg}",
            "nodeIps": "{10.212.52.87}",
            "summary": {
                "Id": "oSN5LEstThe25y8NrCGyAg",
                "Ip": "10.212.52.87",
                "canUpdate": true,
                "coolOffPeriodInMillis": 300000,
                "currentCapacity": 0,
                "desiredCapacity": 1000,
                "resource": 5
            }
            "timestamp": 1603671580290
        }
    ]
}

Building, Deploying, and Running the RCA Framework

Please refer to the Install Guide for detailed information on building, installing and running the RCA framework.

Current Limitations

  • This is alpha code and is in development.
  • We don't have 100% unit test coverage yet and will continue to add new unit tests.
  • We have tested and verified RCA artifacts only for Docker images. Other distributions are yet to be built and tested and will be available as a part of the final release.

Documentation

Please refer to the technical documentation detailed information on installing and configuring Performance Analyzer.

Contributing

See developer guide and how to contribute to this project.

Code of Conduct

This project has adopted the Amazon Open Source Code of Conduct. For more information see the Code of Conduct FAQ, or contact opensource-codeofconduct@amazon.com with any additional questions or comments.

Security

If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our vulnerability reporting page. Please do not create a public GitHub issue.

License

This project is licensed under the Apache v2.0 License.

Copyright

Copyright OpenSearch Contributors. See NOTICE for details.