toddsundsted/ktistec

Issues using Ktistec with Litestream

EvanKrall opened this issue · 8 comments

A few months back, I set up Litestream to back up the database for my ktistec instance. Everything appeared to work, but recently I became suspicious that nobody ever seemed to interact with my posts. Poking around, I discovered a few bizarre things:

Background workers failed to start because database is locked

Indeed, my posts weren't being sent to other servers. I think this is because the task workers (including the transfer task, which posts to all my followers' inboxes, if I understand correctly) didn't spawn correctly on startup, due to Unhandled exception in spawn: database is locked (SQLite3::Exception).

(I'm not totally sure that's what happened, but when I try to spin up a second copy of server, it would print out that error.)

The litestream docs say this type of issue can happen because litestream needs to lock the database for a few seconds at a time here and there, and recommends setting busy_timeout to 5000: https://litestream.io/tips/#busy-timeout

Maybe background tasks could be monitored by the main thread?

Sqlite database not written to; strange temp file in memory

While poking around with the sqlite3 command, I noticed that my ktistec.db file had no table rows after 2023-02-20, despite the fact that Ktistec was happily up and running.

With lsof, I could see that ktistec had a sqlite temp file open (/var/tmp/etilqs_c2fd80c78073d148), but that file had been deleted.

$ sudo lsof -p 2958312
COMMAND     PID USER   FD      TYPE DEVICE  SIZE/OFF      NODE NAME
server  2958312 root  cwd       DIR  0,760      4096  26872327 /app
server  2958312 root  rtd       DIR  0,760      4096  27003226 /
server  2958312 root  txt       REG  0,760  54304568  26872377 /bin/server
server  2958312 root  mem-r     REG  253,0            22549632 /db/ktistec.db-shm (stat: No such file or directory)
server  2958312 root  mem       REG  253,0            26872377 /bin/server (stat: No such file or directory)
server  2958312 root    0u      CHR    1,3       0t0         6 /dev/null
server  2958312 root    1w     FIFO   0,13       0t0 460372260 pipe
server  2958312 root    2w     FIFO   0,13       0t0 460372261 pipe
server  2958312 root    3r     FIFO   0,13       0t0 460372287 pipe
server  2958312 root    4w     FIFO   0,13       0t0 460372287 pipe
server  2958312 root    5ur     REG  253,0 398688256  22549396 /db/ktistec.db
server  2958312 root    6u      REG  253,0 584405552  22549585 /db/ktistec.db-wal
server  2958312 root    7ur     REG  253,0   1146880  22549632 /db/ktistec.db-shm
server  2958312 root    8u  a_inode   0,14         0     13450 [eventpoll]
server  2958312 root    9r     FIFO   0,13       0t0 460371731 pipe
server  2958312 root   10w     FIFO   0,13       0t0 460371731 pipe
server  2958312 root   11u     sock    0,8       0t0 460371732 protocol: TCP
server  2958312 root   13u      REG  0,760    172200  27003212 /var/tmp/etilqs_c2fd80c78073d148 (deleted)
server  2958312 root   15u     sock    0,8       0t0 615526522 protocol: TCP
server  2958312 root   16u     sock    0,8       0t0 615525821 protocol: TCP
server  2958312 root   17u     sock    0,8       0t0 872047571 protocol: TCP

I saved a copy of that deleted file using this trick, but I can't tell what kind of data it is.

I was hoping that killing ktistec would cause the data to get written back to disk, but either that was misguided or I did it incorrectly. Now my Ktistec is up and running again (and... seems to be actually persisting data to disk?), but is missing all data between 2023-02-20 and today's restart.

I've got the etilqs file, and inspecting it with less shows a lot of random binary interspersed with some recognizable strings: Task::Receive, Task::RefreshActor, some recent dates (almost all the last 2 days), and URLs of other people's posts (mostly people I follow, but not entirely). The file command has no idea what kind of file this is.

I don't know what I expect to come of writing this issue. Maybe a cautionary tale for anyone else looking to use litestream with ktistec? I suspect the core problems probably stem from some misconfiguration on my end, though.

@EvanKrall how often do you restart your instance? can you tell if you restarted on 2-20 or if it was running. i'm going to see if i can replicate this locally.

I'm pretty sure Feb 20 was the last restart.

@EvanKrall any chance you have any logs from that date, or any dates, that you'd be willing to share?

as a first step, i am going to upgrade the database support required to v0.20.0 which allows pragmas via the connection string.

419add7 adds support for pragmas including busy-timeout

you should be able to start the server with something like:

KTISTEC_DB=/db/ktistec.db\?busy_timeout=5000 ... /bin/server ...

when the backup starts, activity in the server will block until the backup is complete.

note that main includes a few other commits that i haven't tested extensively in production.

more detail on other pragmas here.

Thanks! I suspect the pragmas will help avoid this situation in the future. Unfortunately, I don't have any logs from that day because I did the restart by deleting the pod, and I've already closed the terminal which had some of the logs. (Which was definitely the wrong way to do it - I should've sent a signal to the ktistec server instead.)

One of these days, I'll set up some some sort of log archiving system.