opensearch-project/opensearch-sdk-java

[META] [BUG] Extension-to-OpenSearch connections have multiple bugs

dbwiddis opened this issue · 6 comments

What is the bug?

This is a consolidation of multiple bugs and attempts to fix (not yet resolved) that are creating various blockers for @scrawfor99 and @cwperks (Security), @ryanbogan and @joshpalis (Integration testing), and @dbwiddis and @vibrantvarun (performance testing).

Root problem: Extensions need the OpenSearch IP and port to create a transport connection. This is defined/available in multiple places:

  1. The "published" address used by OpenSearch when creating a connection (where it says to connect)
  2. The bind address and port used by OpenSearch (where it is listening) (what we need)
  3. The localNode address passed by OpenSearch to SDK in its initialization request, which may or may not match (1) or (2)
  4. The host and port configured in the extension's settings, which may or may not match (1), (2), or (3)

There are some chicken-and-egg issues as well:

  • The extension needs to be running first before OpenSearch so may not even know the correct host at that time
  • The extension needs the address to send transport requests (connection opened during init) and REST requests (at any later time, and may use a different address/port than transport, also unknown before run time).

How can one reproduce the bug?

Attempt to initialize extensions and establish both Transport and REST connections at other than well-known localhost ports.

Do you have any additional context?

Current status:

  1. OpenSearch communicates its local node to the SDK which might be a correct place to call back, but might not
  2. The SDK overwrites the extension settings with this new address and port without checking its validity (#730) except caused new issues (#761, #769), prompting:
  3. If the local node address is exactly "127.0.0.1" then it doesn't overwrite the settings in the step above (#763)
  4. However, in the "connect to local node as extension" it's still using the old port from the request, this got missed being overwritten in 763.
  5. However, that connect to local node as extension is also causing duplicate thread context bugs (#771)
  6. And all the above fixes don't handle the case where the OpenSearchNode has the correct address but the incorrect port (and the fact that we need different address/ports for Transport and REST.

What needs to be done to fix this:

  1. Consider eliminating the SDK-to-OpenSearch transport connection entirely (#767) to get rid of all these problems. Then we just need to communicate the REST address and port.
  2. Consider overriding address but not port
  3. Consider new settings to separate REST and Transport host and port
  4. Consider forcing address/port settings to not be overwritten using some sort of configurable parameter
  5. Consider finding "truth" in a different location (@saratvemulapalli noted here)
  6. Consider attempting multiple connections at all known sources until one works

Action:

  • Immediate
    • @joshpalis and @ryanbogan work with @scrawfor99 to see if reverting PR #763 unblocks security work.
    • Coordinate with @cwperks as needed to resolve #771 if that presents additional issues
  • Near term
    • Implement a new /_extensions/discovery REST API on OpenSearch which re-attempts to initialize any extensions which failed initialization the first time. This will help work around initialization sequencing issues.
  • Longer term
    • Continue discussion on this issue on a better long term fix per above discussion

Next immediate step: leave the SDKClient settings alone as they are needed for REST client and shouldn't have been updated using Transport. I tried to revert #730 but too many changes have happened since. For now, we can simply remove this line:

extensionsRunner.getSdkClient().updateOpenSearchNodeSettings(extensionInitRequest.getSourceNode().getAddress());

This will allow the ExtensionsInitRequestHandler to (try to) make a Transport Connection using the port information provided in the init request, while keeping the REST connections per the Extension settings, giving us the two different connection parameters.

The fix worked for @scrawfor99.

Reverting the problematic portion of #730 resolves #769. We are effectively maintaining the two different connection parameters:

  • Extension Settings used for ExtensionsRunner initialization contain the REST endpoint
  • The openSearchNode communicated with the initialization request contains (usually) the Transport endpoints. This still doesn't work in a single-node OpenSearch case, however. Removing the need for the openSearchNode at all is proposed in #767.

The Extension Settings solution still doesn't resolve the chicken-and-egg situation of needing to know the OpenSearch REST endpoint prior to starting the extension. To resolve this, @owaiskazi19 is working toward a solution:

  • remove the call to ExtensionsManager initialize() in Node.java
  • trigger that initialize call via a REST endpoint. For example, _extensions/discovery or _extensions/initialize or similar.
  • additionally permit adding extensions not present in extensions.yml by passing _extensions/initialize/{name} and including a body which includes the same parameters as are currently in the .yaml file for each extension.

We can update the OpenSearch REST connections during extension initialization using the environment settings passed to the ExtensionsManager from Node.java here . See the http.port setting below :

{
  "client.type": "node",
  "cluster.initial_cluster_manager_nodes": [
    "runTask-0"
  ],
  "cluster.name": "runTask",
  "cluster.routing.allocation.disk.watermark.flood_stage": "1b",
  "cluster.routing.allocation.disk.watermark.high": "1b",
  "cluster.routing.allocation.disk.watermark.low": "1b",
  "discovery.initial_state_timeout": "0s",
  "discovery.seed_hosts": "127.0.0.1:9300",
  "discovery.seed_providers": "file",
  "http.port": "9200",
  "http.type.default": "netty4",
  "indices.breaker.total.use_real_memory": "false",
  "logger.org.opensearch.action.support.master": "DEBUG",
  "logger.org.opensearch.cluster.coordination": "DEBUG",
  "node.attr.shard_indexing_pressure_enabled": "true",
  "node.attr.testattr": "test",
  "node.name": "runTask-0",
  "node.portsfile": "true",
  "path.data": [
    "/local/home/jpalis/repos/integ/OpenSearch/build/testclusters/runTask-0/data"
  ],
  "path.home": "/local/home/jpalis/repos/integ/OpenSearch/distribution/archives/linux-tar/build/install/opensearch-3.0.0-SNAPSHOT",
  "path.logs": "/local/home/jpalis/repos/integ/OpenSearch/build/testclusters/runTask-0/logs",
  "path.repo": [
    "/local/home/jpalis/repos/integ/OpenSearch/build/testclusters/runTask-0/repo"
  ],
  "path.shared_data": "/local/home/jpalis/repos/integ/OpenSearch/build/testclusters/runTask-0/sharedData",
  "script.disable_max_compilations_rate": "true",
  "transport.port": "9300",
  "transport.type.default": "netty4"
}

OpenSearch defaults its http port to 9200 if not otherwise configured within the opensearch.yml via the http.port setting, so we can check if the setting is present and if not we can also default the extension REST clients to use port 9200. Sending this information to the extension as part of the ExtensionInitRequest will allow us to update the SDKRestClient and also OpenSearchAsyncClient after they have been created using the extension settings within createComponents(). Dynamically updating the rest clients during initialization should help in running integration tests for extensions, as the OpenSearch distribution spun on for integration test clusters randomizes its transport/http ports.

We can update the OpenSearch REST connections during extension initialization using the environment settings passed to the ExtensionsManager from Node.java here .

Nice find!

See the http.port setting below

That's great for the REST client, but you didn't mention this:

"transport.port": "9300"

This doesn't solve for all use cases but is definitely helpful in filling in parts of the puzzle.