Restore fails when there are many segments in a wal
Codemonkey1973 opened this issue · 0 comments
I've recently started using litestream (0.3.13) to replicate a database to AWS S3 that incurs a lot of writes (incoming telemetry data from a few 1000 remote devices), but not so many reads (a website allowing users to view their device data), plus a few periodic housekeeping jobs.
I've been trying to restore a copy of the database to a development machine and it failed to restore almost every time which is a bit annoying to say the least! I've been poking at it for a few days now and I think I figured out what is causing my problems.
Each time I try a restore, it downloads the snapshot ok (taken each 12 hours) at about 48GB, however it then starts downloading the WAL's (about 1000 when I tried this morning) and gets to a specific index and stops, eventually exiting with a connection closed error.
Initially, i thought it was just a flaky connection or something, so I ran the restore command with -parallelism 1 so I could better see what was going on.
Each time, it got to one specific index (0x000108f5 / 67829) and then it would eventually quit with a connection closed error.
I figured that maybe adding some retrys might help, so I cloned the source, poked at it a bit and found where the WAL's are downloaded, then made a simple change to get it to retry a few times if a WAL download failed.
My modification was in replica.go, in the function Restore(), I changed the bit after line 1166 to:
startTime := time.Now()
for x := 0; x < 50; x++ {
err := r.downloadWAL(ctx, opt.Generation, index, walSegmentMap[index], tmpPath)
if err != nil {
r.Logger().Info(fmt.Sprintf("download of wal file at index %d failed at attempt %d", index, x))
} else {
break
}
}
if err != nil {
err = fmt.Errorf("cannot download wal %s/%08x: %w", opt.Generation, index, err)
}
This resulted in the following in the log:
time=2024-05-21T09:47:27.790+01:00 level=INFO msg="applied wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67827 elapsed=305.002486ms
time=2024-05-21T09:47:30.099+01:00 level=INFO msg="downloaded wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67828 elapsed=2.613678192s
time=2024-05-21T09:47:30.423+01:00 level=INFO msg="applied wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67828 elapsed=323.935628ms
time=2024-05-21T09:49:35.256+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 0" db=/database/telemetry.db replica=s3
time=2024-05-21T09:51:53.301+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 1" db=/database/telemetry.db replica=s3
time=2024-05-21T09:54:06.738+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 2" db=/database/telemetry.db replica=s3
time=2024-05-21T09:56:20.262+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 3" db=/database/telemetry.db replica=s3
time=2024-05-21T09:58:34.568+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 4" db=/database/telemetry.db replica=s3
time=2024-05-21T10:00:53.648+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 5" db=/database/telemetry.db replica=s3
time=2024-05-21T10:03:14.053+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 6" db=/database/telemetry.db replica=s3
time=2024-05-21T10:05:27.657+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 7" db=/database/telemetry.db replica=s3
time=2024-05-21T10:07:47.624+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 8" db=/database/telemetry.db replica=s3
time=2024-05-21T10:09:58.974+01:00 level=INFO msg="download of wal file at index 67829 failed at attempt 9" db=/database/telemetry.db replica=s3
time=2024-05-21T10:09:58.978+01:00 level=INFO msg="downloaded wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67829 elapsed=22m28.878682695s
time=2024-05-21T10:09:59.553+01:00 level=INFO msg="applied wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67829 elapsed=569.055294ms
time=2024-05-21T10:09:59.817+01:00 level=INFO msg="downloaded wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67830 elapsed=839.020938ms
time=2024-05-21T10:10:00.129+01:00 level=INFO msg="applied wal" db=/database/telemetry.db replica=s3 generation=3b1e1e70353523b5 index=67830 elapsed=311.867642ms
Eventually the restore completed and appears to have been successful. Note that it took >22mins to download that one wal, much longer than the 7mins it took to download the initial 48GB snapshot.
However, while it was doing the retrys I added, I had a look on the AWS S3 console to see if there was a problem with that specific WAL index. While most of the wal's consist of just a few 10s of individual segments, this one particular wal consisted of over 1000.
I suspect this wal is being created at a specific time where there is a long running job keeping a lock on the database, therefore the data in the wal isn't getting written to the database for a period of some minutes, and meanwhile there is still a lot of incoming data.
Basically, I think the restore its giving up due to the time taken to download all of the WAL segments. It would be nice if this could either be increased or specified somewhere.