benbjohnson/litestream

Unclean close leaves Litestream in weird state

Closed this issue · 5 comments

hifi commented

I've been using Litestream as a library and tried a couple of things to abuse it since the usage pattern is a bit different than when using it as a separate process.

What I'm seeing is that the final sync at the beginning of Close() will fail due to this because I had sqlite3 cli just writing while closing (and rolling in the WAL): cannot verify wal state: stat foo/bar.db-wal: no such file or directory. This might be fine to ignore but the caller doesn't know it and calling Close() again will fail because the underlying database connection has already been closed and will always return an error at that point.

This is very much an edge case of an edge case and it shouldn't ever really happen but the way database Close() is structured it's designed to ignore all errors and just continue but it might be better to return early and allow calling it more than once to retry if it's a passing issue or handle the errors more gracefully. Now because it's designed to ignore errors the caller needs to assume the database is somewhat closed and ignore the returned error and at most log it.

Do you know if it's your sqlite3 CLI or Litestream that's removing the WAL? Litestream sets the SQLITE_FCNTL_PERSIST_WAL to avoid removing the WAL on close so maybe you need to set it with your client as well? I'm not sure if that's possible with the CLI though.

https://github.com/benbjohnson/litestream//blob/c1ae96818888693bddcf4913bd8dcb9baa55eb05/litestream.go#L55C1-L64

hifi commented

It's definitely the cli that's removing the WAL but Litestream not closing cleanly after that is slightly problematic because it's hard to know from the caller if the database has been closed or not as the function returns an error and can't be called again.

Definitely not something that one would expect in normal circumstances but not being able to retry or "safely" ignore the error from Close() is what I'm looking at here.

hifi commented

I'm currently testing a race where I start replicating and call Sync() once and immediately after I run sqlite3 once that writes and it somehow succeeds to roll in the WAL but that shouldn't be possible if there's a read lock by Litestream, right? I've verified the read lock is in fact acquired successfully before the CLI goes in and does damage.

It doesn't always happen so there seems to be a race somewhere that I can't prevent by calling Sync() manually once on startup. My current theory is the read lock doesn't prevent rolling in the WAL file for whatever reason and is basically no-op for that sometimes but not always.

After a bit of testing, it seems that to be safe one should always use .filectrl persist_wal 1 when connecting to a database that Litestream is replicating.

To do this on startup, one can put the meta-command in a sqlite init script:

Create a sqlite init script sqlite3.init

PRAGMA wal_autocheckpoint = 0;
.filectrl persist_wal 1
sqlite3 --init sqlite3.init database.sqlite

In my case, I created a wrapper script called /usr/bin/sqlite3 and moved my sqlite3 binary to /usr/bin/sqlite3.real to hopefully prevent myself from triggering this issue.

I'm pretty satisfied with this solution for now.

Thanks for raising this issue and the response!

hifi commented

This was actually worse than I thought and eventually lead to database corruption, opened #519 with a fix.