apache/accumulo

Wait for scan servers when none are present.

Closed this issue · 11 comments

Is your feature request related to a problem? Please describe.

When using scan servers and the default plugins if the set of scan servers is empty, then the default plugin will fall back to tablet severs. Depending on how resources are allocated this may be undesirable and cause problems. Scan servers would be more useful if it was possible to wait for scan servers when none currently exists.

Describe the solution you'd like

Add an option to ConfigurableScanServerSelector for waiting on scan servers. This could look like the following where "waitForScanServers":true is the new option.

    [
     {
       "isDefault":true,
       "maxBusyTimeout":"5m",
       "busyTimeoutMultiplier":4,
       "waitForScanServers":true
       "attemptPlans":[
         {"servers":"3", "busyTimeout":"33ms"},
         {"servers":"100%", "busyTimeout":"100ms"}
       ]
     }
    ]

Support for this new option could be added by having the ConfigurableScanServerSelector loop where it sleeps and checks the set of scan servers.

Scanners can have timeouts set, so waiting in the plugin could violate those timeouts. To handle this, may want to add support to the SPI for waiting that takes the scanner timeouts into consideration. Could add support to ScanServerSelector.SelectorParameters for waiting

  interface SelectorParameters {
       /**
         *  Determines if a scan server selector can wait for a conditoin to be true (like scan servers to be present).  If this returns true and the desired state is not met, then throw a timed out exception.
         */
       boolean hasTimedOut();
 }

Then ConfigurableScanServerSelector could call the hasTimedOut() function in its loop waiting for non zero tservers. Adding support for timeouts to the SPI could be follow on work.

I think it makes sense for the ConfigurableScanServerSelector to wait for ScanServers to be available instead of immediately falling back to TabletServers. However, I think there should be a timeout set, I don't think the client should wait forever.

I think it makes sense for the ConfigurableScanServerSelector to wait for ScanServers to be available instead of immediately falling back to TabletServers. However, I think there should be a timeout set, I don't think the client should wait forever.

Instead of the config being a boolean, it could be a timeout. Maybe something like the following.

    [
     {
       "isDefault":true,
       "maxBusyTimeout":"5m",
       "busyTimeoutMultiplier":4,
       "timeToWaitForScanServers":"120s"
       "attemptPlans":[
         {"servers":"3", "busyTimeout":"33ms"},
         {"servers":"100%", "busyTimeout":"100ms"}
       ]
     }
    ]

Then if someone wants to wait forever they can set that config really high.

I can take a look at this.

It looks like in the elasticity branch ConfigurableScanServerSelector has a check where it is waiting for scan servers to be present. Are you looking for an implementation along the lines of this for 2.1?

It looks like in the elasticity branch ConfigurableScanServerSelector has a check where it is waiting for scan servers to be present. Are you looking for an implementation along the lines of this for 2.1?

Yeah, we want that feature back to 2.1. However need to change the boolean to a timeout value. It might be best start w/ the elasticity branch and change its boolean to a timeout value. Once that change is made can use elasticity as guide for making the change in 2.1.

Thanks for the clarification. I can go ahead and get the changes made in elasticity. Do we want to make another ticket for adding the changes to 2.1?

@ArbaazKhan1 do you know if there is anything left for this issue?

No, everything should be finished for this issue.

This issue is marked for 2.1.3, but the linked PR is for 4.0.0. There are some open unanswered questions about backporting something for 2.1. I'm re-opening this until that question is answered and the ticket's milestone is properly marked based on the answer.

@ctubbsii - #4638 was merged into 4.0.0 and follow-on PR #4671 backported it to 2.1. #4671 could be named better, but it appropriately marked for the 2.1.3 milestone. I don't think there is more to do here.

Thanks!