apache/accumulo

Improve log recovery times

Opened this issue · 3 comments

Is your feature request related to a problem? Please describe.

Write ahead log recovery can take a while because of the following two behaviors.

  • Tablet servers processes only do a single log recovery at time
  • All tablets, even if they have no data in the write ahead log, will go through the log recovery process when being loaded on a tablet.

Those behaviors make log recovery times correlate with the number of tablets per tserver. So as the number of tablets per tserver increases, log recovery time increases.

Describe the solution you'd like

Allow parallel log recovery and faster log recovery. The parallelism is related to #4429, but that change does not completely solve the issue as the lock is still acquired for log recovery.

  • Use a cache during log recovery when reading from sorted walog rfiles
  • Inspect tablet w/ logs before acquiring recovery lock to see if they contain data

Describe alternatives you've considered

Could potentially produce an F file for log recovery outside of the tablet server somewhere (similar to external compactions). This may have been discussed on an elasticity related issue, but could not find it. This would be a much larger change and probably would be suitable to do in 2.1. It may require completly refactoring the tablet minor compaction code to make it usable elsewhere.

and probably would be suitable to do in 2.1

Did you mean "would not be"?

@keith-turner - you might be thinking of #4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.

@keith-turner - you might be thinking of #4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.

That change could speed up log sorting. The problem in this issue happens after the logs are sorted and when tablets w/ sorted walogs are loaded on a tablet server. Tablet severs only load one tablet w/ walogs at time which is what makes things slow.