opensearch-project/performance-analyzer-rca

[FEATURE] Change the logic of RCA conditional execution tag system to optimally cover all use-cases

Tjofil opened this issue · 0 comments

Is your feature request related to a problem?

In RCA Framework, there are currently 3 different locus tags to choose from when restricting certain Node's (RCA Node) execution to certain Node types (OS Node). The latter, LOCUS_DATA_CLUSTER_MANAGER_NODE, renders LOCUS_DATA_NODE basically useless in terms of selective RCA node execution:
When calculating RCA nodes to execute locally, RCASchedulerTask consults these tags, matches them against ones defined in .conf files, and decides if node should be executed locally. Config files are picked based on node's role, through InstanceDetails object. The problem is that there is no support for treating a node both as Data and Cluster Manager node at the same time inside RCA, and when there is a node like this inside cluster, it gets treated only as a Cluster Manager node. Now, by marking the the RCA nodes with LOCUS_DATA_CLUSTER_MANAGER_NODE, and having a dedicated Cluster manager node, we will have Data-dependent metrics as well as Data node specific analyses executing on the dedicated CM node without any effect (or causing exceptions like in #305), plus introducing unnecessary overhead on top of sometimes both memory and CPU hungry Cluster Manager specific analyses already running on it.
Tagging these RCA nodes with only the LOCUS_DATA_NODE tag would result in them not executing for situations where node is both Data and Cluster Manager, because from the point of RCA logic, it will be treated as Cluster Manager only.
Neither of these situations is ideal and we want to be able to restrict certain analyses from executing on dedicated cluster manager nodes while not preventing them from executing on non-dedicated Cluster Manager nodes.

What solution would you like?
Changing the logic and granularity of mentioned execution tag system so that RCA Nodes can be tagged for Data and Cluster Manager nodes as well for hybrid nodes as mentioned in previous paragraph.

What alternatives have you considered?
Eventual change of the way that .conf files work right now as they influence the conditional local execution, but this may not be ideal.