[DocDB] Racing DeleteSnapshot and RestoreSnapshot can corrupt the WAL, crashlooping affected TServers
Closed this issue · 0 comments
Jira Link: DB-11986
Description
To handle a RestoreSnapshot
request for a snapshot that is not part of a schedule, the leader master sends an RPC to each tablet of the snapshot. Similarly, for DeleteSnapshot
the leader master sends an RPC to each tablet of the snapshot. The tserver RPC handlers are relatively simple, doing some light validation and then committing a raft op to actually do the work.
There is no validation either in master or the tserver to prevent a tablet leader from committing a DELETE_ON_TABLET
raft op and then committing a RESTORE_ON_TABLET
raft op for the same snapshot. When this occurs the tablet is corrupted and the tservers hosting tablet peers enter a crash loop: the apply for the DELETE_ON_TABLET
removes the snapshot directory, and the apply for the restore fails when it doesn't find the snapshot directory. Our raft machinery crashes if a raft apply op returns a non-ok status, killing the tserver. On boot, tservers try to apply un-applied entries from the WAL, crashing again when they try to apply the RESTORE_ON_TABLET
op.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.