benbjohnson/litestream

Is -timestamp broken when restoring?

xrd opened this issue · 1 comments

xrd commented

I've been reviewing the code and think that either my expectations are incorrect about files on s3 created by litestream, or there is a bug in the way that restore works with timestamps.

I expected that if I run litestream, that it is storing checkpoints of the database (https://litestream.io/how-it-works/). If I understand this correctly, I should see a bunch of checkpoints/generations inside s3. I don't see these on s3, so perhaps I have misconfigured litestream. To be clear, I do see litestream publishing a db and WAL into a generations / UUID directory, but only a single file in both.

A note on the system: we had a bug in the production system, so during an event I rebooted the litestream process (and subprocess for the sqlite tool: pocketbase). It used the same database file, but I wonder if this would cause litestream to lose track of the checkpoints/generations and this might be why my timestamp restoration process fails.

At a minimum, the error message (no matching backups found) seems off, because there are backups, just perhaps not at this timestamp (restore works with the same config file but no -timestamp switch).

I have this structure on my production system.

$ find pb/.data.db-litestream/
pb/.data.db-litestream/
pb/.data.db-litestream/generation
pb/.data.db-litestream/generations
pb/.data.db-litestream/generations/b5f5868887f877ae
pb/.data.db-litestream/generations/b5f5868887f877ae/wal
pb/.data.db-litestream/generations/b5f5868887f877ae/wal/00000643.wal
pb/.data.db-litestream/generations/b5f5868887f877ae/wal/00000644.wal

Does this indicate I have only two checkpoints? And, since there are tens of thousands of records spanning an event that was three or four hours, this indicates there was an issue with litestream or my configuration of litestream? Specifically, should there be many more WAL files?

I also see this in the database itself, there are WAL files there:

pb/data.db
pb/logs.db
pb/data.db-shm
pb/data.db-wal
pb/logs.db-wal
pb/storage

Discussed in #563

Originally posted by xrd February 1, 2024
Is this a bug or am I misunderstanding the -timestamp switch?

I have a database stored on s3 (actually minio, but it works).

Restore works and I see items in the database on 1-29-2024 and 1-31-2024.

But, when I re-run the restore command adding the -timestamp 2024-01-30T18:00:00Z switch, litestream complains with no matching backups found.

(Also, just an aside, but if I look at the files stored inside s3 (minio), I only see a 10MB file and then a 119 byte WAL file. I'm wondering if I am not looking at all the files, because how can that expand to 369MB when the db file is restored?)

chrisdawson@nixos:~/litestream-tmp$ cat event.yml 
dbs:
  - path: blah.db
    replicas:
      - url: s3://xxx/xxx
        endpoint: https://examplecom
        access-key-id: xxx
        secret-access-key: xxx

chrisdawson@nixos:~/litestream-tmp$ ./litestream restore -config event.yml ./blah.db
time=2024-02-01T12:28:58.343-05:00 level=INFO msg="restoring snapshot" db=/home/chrisdawson/litestream-tmp/blah.db replica=s3 generation=b5f5868887f877ae index=1594 path=/home/chrisdawson/litestream-tmp/blah.db.tmp
time=2024-02-01T12:29:00.183-05:00 level=INFO msg="restoring wal files" db=/home/chrisdawson/litestream-tmp/blah.db replica=s3 generation=b5f5868887f877ae index_min=1594 index_max=1594
time=2024-02-01T12:29:00.266-05:00 level=INFO msg="downloaded wal" db=/home/chrisdawson/litestream-tmp/blah.db replica=s3 generation=b5f5868887f877ae index=1594 elapsed=83.115207ms
time=2024-02-01T12:29:00.278-05:00 level=INFO msg="applied wal" db=/home/chrisdawson/litestream-tmp/blah.db replica=s3 generation=b5f5868887f877ae index=1594 elapsed=11.732843ms
time=2024-02-01T12:29:00.278-05:00 level=INFO msg="renaming database from temporary location" db=/home/chrisdawson/litestream-tmp/blah.db replica=s3
chrisdawson@nixos:~/litestream-tmp$ ls -lah blah.db 
-rw------- 1 chrisdawson users 369M Feb  1 12:29 blah.db
chrisdawson@nixos:~/litestream-tmp$ rlwrap sqlite3 ./blah.db 'select created from logs order by created desc limit 1'
2024-01-31 16:25:36.484Z
chrisdawson@nixos:~/litestream-tmp$ rlwrap sqlite3 ./blah.db 'select created from logs order by created asc limit 1'
2024-01-29 00:11:06.279Z
chrisdawson@nixos:~/litestream-tmp$ mv blah.db old.db
chrisdawson@nixos:~/litestream-tmp$ ./litestream restore -timestamp 2024-01-30T18:00:00Z -config event.yml ./blah.db
time=2024-02-01T12:29:44.983-05:00 level=ERROR msg="failed to run" error="no matching backups found"

wallflower-wal
wallflower

xrd commented

I rebooted the litestream process and now see a bunch of files inside the WAL directory. I assume this means litestream was not running correctly (though to be honest this is confusing because I didn't reconfigure it before rebooting). This is probably why I got the message "no matching backups" because there were not WAL files to recover from at that date.