Issues using Ktistec with Litestream
EvanKrall opened this issue · 8 comments
A few months back, I set up Litestream to back up the database for my ktistec instance. Everything appeared to work, but recently I became suspicious that nobody ever seemed to interact with my posts. Poking around, I discovered a few bizarre things:
Background workers failed to start because database is locked
Indeed, my posts weren't being sent to other servers. I think this is because the task workers (including the transfer task, which posts to all my followers' inboxes, if I understand correctly) didn't spawn correctly on startup, due to Unhandled exception in spawn: database is locked (SQLite3::Exception)
.
(I'm not totally sure that's what happened, but when I try to spin up a second copy of server
, it would print out that error.)
The litestream docs say this type of issue can happen because litestream needs to lock the database for a few seconds at a time here and there, and recommends setting busy_timeout to 5000: https://litestream.io/tips/#busy-timeout
Maybe background tasks could be monitored by the main thread?
Sqlite database not written to; strange temp file in memory
While poking around with the sqlite3
command, I noticed that my ktistec.db file had no table rows after 2023-02-20, despite the fact that Ktistec was happily up and running.
With lsof, I could see that ktistec had a sqlite temp file open (/var/tmp/etilqs_c2fd80c78073d148
), but that file had been deleted.
$ sudo lsof -p 2958312
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
server 2958312 root cwd DIR 0,760 4096 26872327 /app
server 2958312 root rtd DIR 0,760 4096 27003226 /
server 2958312 root txt REG 0,760 54304568 26872377 /bin/server
server 2958312 root mem-r REG 253,0 22549632 /db/ktistec.db-shm (stat: No such file or directory)
server 2958312 root mem REG 253,0 26872377 /bin/server (stat: No such file or directory)
server 2958312 root 0u CHR 1,3 0t0 6 /dev/null
server 2958312 root 1w FIFO 0,13 0t0 460372260 pipe
server 2958312 root 2w FIFO 0,13 0t0 460372261 pipe
server 2958312 root 3r FIFO 0,13 0t0 460372287 pipe
server 2958312 root 4w FIFO 0,13 0t0 460372287 pipe
server 2958312 root 5ur REG 253,0 398688256 22549396 /db/ktistec.db
server 2958312 root 6u REG 253,0 584405552 22549585 /db/ktistec.db-wal
server 2958312 root 7ur REG 253,0 1146880 22549632 /db/ktistec.db-shm
server 2958312 root 8u a_inode 0,14 0 13450 [eventpoll]
server 2958312 root 9r FIFO 0,13 0t0 460371731 pipe
server 2958312 root 10w FIFO 0,13 0t0 460371731 pipe
server 2958312 root 11u sock 0,8 0t0 460371732 protocol: TCP
server 2958312 root 13u REG 0,760 172200 27003212 /var/tmp/etilqs_c2fd80c78073d148 (deleted)
server 2958312 root 15u sock 0,8 0t0 615526522 protocol: TCP
server 2958312 root 16u sock 0,8 0t0 615525821 protocol: TCP
server 2958312 root 17u sock 0,8 0t0 872047571 protocol: TCP
I saved a copy of that deleted file using this trick, but I can't tell what kind of data it is.
I was hoping that killing ktistec would cause the data to get written back to disk, but either that was misguided or I did it incorrectly. Now my Ktistec is up and running again (and... seems to be actually persisting data to disk?), but is missing all data between 2023-02-20 and today's restart.
I've got the etilqs file, and inspecting it with less
shows a lot of random binary interspersed with some recognizable strings: Task::Receive
, Task::RefreshActor
, some recent dates (almost all the last 2 days), and URLs of other people's posts (mostly people I follow, but not entirely). The file
command has no idea what kind of file this is.
I don't know what I expect to come of writing this issue. Maybe a cautionary tale for anyone else looking to use litestream with ktistec? I suspect the core problems probably stem from some misconfiguration on my end, though.
@EvanKrall how often do you restart your instance? can you tell if you restarted on 2-20 or if it was running. i'm going to see if i can replicate this locally.
I'm pretty sure Feb 20 was the last restart.
@EvanKrall any chance you have any logs from that date, or any dates, that you'd be willing to share?
as a first step, i am going to upgrade the database support required to v0.20.0 which allows pragmas via the connection string.
419add7 adds support for pragmas including busy-timeout
you should be able to start the server with something like:
KTISTEC_DB=/db/ktistec.db\?busy_timeout=5000 ... /bin/server ...
when the backup starts, activity in the server will block until the backup is complete.
note that main includes a few other commits that i haven't tested extensively in production.
more detail on other pragmas here.
Thanks! I suspect the pragmas will help avoid this situation in the future. Unfortunately, I don't have any logs from that day because I did the restart by deleting the pod, and I've already closed the terminal which had some of the logs. (Which was definitely the wrong way to do it - I should've sent a signal to the ktistec server instead.)
One of these days, I'll set up some some sort of log archiving system.