Splitstore race condition due to caching and reverts
Stebalien opened this issue · 3 comments
The splitstore may remove important state given the following sequence of events:
- Client syncs to tipset A at height X.
- Client switches to tipset B at height X.
- Splitstore starts garbage collecting.
- Client switches back to tipset A at height X.
In step 4, the client will not re-execute tipset A because it'll be in the cache so the state for tipset will not get re-written. The splitstore will fail to keep the state from tipset A because (a) it was not reachable from tipset B and (b) it was not written after garbage collection started.
This can lead to corrupted datastores with missing blocks, leading to state mismatches and sync failures when the splitstore is enabled.
2024-09-03
During the triage we discussed if we could drop the cache (maybe once a day). But we need to investigate if this is feasible. @ZenGround0 you have a lot of knowledge about the Splitstore, do you know if this would be okay? And also, once we schedule some more time to tackle this issue, maybe pair up with another so we can do some knowledge share about splitstore
During the triage we discussed if we could drop the cache (maybe once a day).
Specifically, drop the state cache for all tipsets not on the canonical chain at the start of compaction. That way we have to recompute their state when switching to them, ensuring the splitstore sees that their state is live.
@rjan90 I could probably help speed up with a pair and would like to do that. I am rusty though so I will need time to understand what's going on and the proposed solution.