Unclean close leaves Litestream in weird state
Closed this issue · 5 comments
I've been using Litestream as a library and tried a couple of things to abuse it since the usage pattern is a bit different than when using it as a separate process.
What I'm seeing is that the final sync at the beginning of Close() will fail due to this because I had sqlite3 cli just writing while closing (and rolling in the WAL): cannot verify wal state: stat foo/bar.db-wal: no such file or directory
. This might be fine to ignore but the caller doesn't know it and calling Close() again will fail because the underlying database connection has already been closed and will always return an error at that point.
This is very much an edge case of an edge case and it shouldn't ever really happen but the way database Close() is structured it's designed to ignore all errors and just continue but it might be better to return early and allow calling it more than once to retry if it's a passing issue or handle the errors more gracefully. Now because it's designed to ignore errors the caller needs to assume the database is somewhat closed and ignore the returned error and at most log it.
Do you know if it's your sqlite3
CLI or Litestream that's removing the WAL? Litestream sets the SQLITE_FCNTL_PERSIST_WAL
to avoid removing the WAL on close so maybe you need to set it with your client as well? I'm not sure if that's possible with the CLI though.
It's definitely the cli that's removing the WAL but Litestream not closing cleanly after that is slightly problematic because it's hard to know from the caller if the database has been closed or not as the function returns an error and can't be called again.
Definitely not something that one would expect in normal circumstances but not being able to retry or "safely" ignore the error from Close() is what I'm looking at here.
I'm currently testing a race where I start replicating and call Sync() once and immediately after I run sqlite3 once that writes and it somehow succeeds to roll in the WAL but that shouldn't be possible if there's a read lock by Litestream, right? I've verified the read lock is in fact acquired successfully before the CLI goes in and does damage.
It doesn't always happen so there seems to be a race somewhere that I can't prevent by calling Sync() manually once on startup. My current theory is the read lock doesn't prevent rolling in the WAL file for whatever reason and is basically no-op for that sometimes but not always.
After a bit of testing, it seems that to be safe one should always use .filectrl persist_wal 1
when connecting to a database that Litestream is replicating.
To do this on startup, one can put the meta-command in a sqlite init script:
Create a sqlite init script sqlite3.init
PRAGMA wal_autocheckpoint = 0;
.filectrl persist_wal 1
sqlite3 --init sqlite3.init database.sqlite
In my case, I created a wrapper script called /usr/bin/sqlite3
and moved my sqlite3
binary to /usr/bin/sqlite3.real
to hopefully prevent myself from triggering this issue.
I'm pretty satisfied with this solution for now.
Thanks for raising this issue and the response!