benbjohnson/litestream

Clarify autocheckpoints consequences

FiloSottile opened this issue · 5 comments

https://litestream.io/tips/ recommends turning off autocheckpoints for high write load servers, if I understand correctly because the application might race Litestream while it switches locks.

I would like to understand the consequences of such a race happening. Is Litestream just going to notice it missed a WAL and make a fresh snapshot, like it does when it is stopped and restarted, or is it going to corrupt the replica?

I am asking because I would like not to make my application dependent on Litestream for checkpointing, and I am willing to take the risk of a few more snapshots, but not of a corrupt replica.

hifi commented

I'm not the original author but have worked with Litestream for a good while so take my comments with a grain of salt.

From what I understand from the code and some late fixes I've done regarding checkpointing, if the WAL gets successfully checkpointed outside Litestream's supervision it will declare that it has lost the position due to checksum mismatch and force a new generation as you've thought.

This could be tested by disabling the persistent read lock code path and forcing checkpoints and writes to a database during replication.

However I don't quite follow your reasoning why you want your application to control checkpointing? Litestream intentionally keeps a read transaction open to prevent application checkpoints from rolling in the WAL from the outside so in practice all of your own checkpoints would fail unless they race Litestream successfully which is an error condition.

Thank you for the answer! Glad to hear it fails safe. Maybe it's worth mentioning on the tips page? As it is, it looked a bit scary. ("So wait, if I don't read this page and remember to set a PRAGMA do I risk corruption?")

However I don't quite follow your reasoning why you want your application to control checkpointing? Litestream intentionally keeps a read transaction open to prevent application checkpoints from rolling in the WAL from the outside so in practice all of your own checkpoints would fail unless they race Litestream successfully which is an error condition.

Oh I wasn't clear sorry. I am saying that I don't want my application to have to be run with Litestream. Some users might use Litestream, some users might use EBS atomic snapshots, some users might choose to have no replication. If I turn off autocheckpointing, users that don't run Litestream will end up with an endlessly growing WAL, so I'd have to add a config option, which is annoying and error prone.

hifi commented

Ah, right, that makes sense.

I'd suggest keeping sane defaults and allowing overriding the PRAGMAs in config if someone wants to improve compatibility with Litestream so checkpointing on by default. That's what we've been doing: https://github.com/mautrix/go-util/blob/main/dbutil/litestream/register.go

But it indeed should be safe to have checkpointing on and it would at most force a new generation/snapshot if the app wins the unlikely race.

hifi commented

Just to be sure I ran some tests where the read lock was intentionally removed from Litestream so it couldn't prevent external checkpoints at all. Regardless how much I abused it it would successfully always recover with:

time=2023-12-20T12:46:52.927+02:00 level=INFO msg="sync: new generation" db=//path/to/test.db generation=b9b04512b4365e9a reason="wal overwritten by another process"
time=2023-12-20T12:46:52.931+02:00 level=INFO msg="write snapshot" db=/path/to/test.db replica=file position=b9b04512b4365e9a/00000001:4152

So at worst it would do as expected and start a new generation if it lost the WAL. The remote was never corrupted and could always restore up to latest sync.

I'll update the documentation to be less scary about that, thanks!

hifi commented

I added a new sentence to the paragraph:

When Litestream notices this it will force a new generation and take a full snapshot to ensure consistency.

That should clear up the fear of it breaking. Closing this issue.