martinsumner/leveled

Snapshot releasing on shutdown

martinsumner opened this issue · 2 comments

When a bookie is shutdown, there may be ongoing snapshots (supporting queries) which will be impacted by the shutdown.

The penciller and the inker take different approaches. The penciller shuts down regardless of snapshots, and the inker will close snapshots before shutting down.

Both situations have led to issues.

  • With the inker a snapshot may have been released and shutdown during the shutdown phase of the inker. This is particularly true when Journal compaction is running, as the clerk must stop before the snapshots are shutdown, and the clerk may be busy scoring/re-writing a file. This causes the inker to have an unhandled noproc exception when it tries to shutdown the snapshot - and the inker then crashes rather than shutting down.
  • When leveled is being run as a parallel key store, aae_folds may still be running when a key store is shutdown due to rebuilds. This leads to aae_fold query crashes.

One possible solution is to have an option to pass a delay budget into the calls to book_destroy/book_close so that on shutdown the inker or penciller can loop waiting for snapshots to be released up to the delay budget before triggering shutdown. The delay budget should be generally small (e.g. 20 seconds) but could be set to a larger value (e.g. 300 seconds) when shutting down a store due to aae_rebuild.

The level of additional complexity here needs to be balanced against the fairly limited benefit. The issue with Inker shutdown is an issue in Riak when trying to shut down a node when significant AAE activity is ongoing. The consequences of this are handled though, eventually due to AAE tree rebuilds triggered on startup.

Likewise the occasional failure of an aae_fold is an inconvenience rather than a critical problem.