[DocDB] Racing DeleteSnapshot and RestoreSnapshot can corrupt the WAL, crashlooping affected TServers

Question

[DocDB] Racing DeleteSnapshot and RestoreSnapshot can corrupt the WAL, crashlooping affected TServers

Closed this issue 2 months ago · 0 comments

Jira Link: DB-11986

Description

To handle a RestoreSnapshot request for a snapshot that is not part of a schedule, the leader master sends an RPC to each tablet of the snapshot. Similarly, for DeleteSnapshot the leader master sends an RPC to each tablet of the snapshot. The tserver RPC handlers are relatively simple, doing some light validation and then committing a raft op to actually do the work.

There is no validation either in master or the tserver to prevent a tablet leader from committing a DELETE_ON_TABLET raft op and then committing a RESTORE_ON_TABLET raft op for the same snapshot. When this occurs the tablet is corrupted and the tservers hosting tablet peers enter a crash loop: the apply for the DELETE_ON_TABLET removes the snapshot directory, and the apply for the restore fails when it doesn't find the snapshot directory. Our raft machinery crashes if a raft apply op returns a non-ok status, killing the tserver. On boot, tservers try to apply un-applied entries from the WAL, crashing again when they try to apply the RESTORE_ON_TABLET op.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.