Handle gap in remote xlog sequence
Vampouille opened this issue · 2 comments
Sometimes pghoard failed to upload some wal segments. Then pghoard restart and upload new wal segments. In this case, on remote sotrage, there is a gap in wal segment sequence.
I have implemented some new metrics to detects this case:
- total wal segments: total wal segments on remote storage
- continious wal segments: total wal segments without gap starting from last uploaded
- valid base backup: total basebackup starting from last uploaded without wal segments gap
- missing wal segments: Missing wal segment on remote storage between basebackups
- missing wal segments at end: Missing wal segment near the master position, should not go higher than 1. Replication lag for remote storage
In the following example:
- total wal segments: 74
- continious wal segments: 47
- valid base backup: 2
- missing wal segments: 6
- missing wal segments at end: 2
Legend:
#: basebackup
-: wal segment on remote storage
x: missing wal segment on remote storage
|: current master position
0: wal segment in progress in pghoard (receive from pg, compress, encrypt, upload)
valid basebackup valid basebackup
↓ ↓
#----------#--------------#---xxxxxx------------#-----------------#------------------00|
↑ ↑↑↑↑↑↑↑ ↑↑↑
other valid wal segments |||||||----------- continious wal segment ------------|||
|||||| ||
missing wal segments missing wal segment at end
To implements those metrics pghoard need to list files uploaded on remote storage and keep this list up to date.
Scanning remote storage (only on startup) can also improve cleanup because when there is a gap in wal sequence pghoard stop to delete useless wal segments. With the list of uploaded files we can delete all wal segments before the first basebackup even with gap.
https://github.com/aiven/pghoard/blob/master/pghoard/pghoard.py#L255-L258
Do you have any details of the cases where this has happened? We're uploading very large number of WAL segments each day and recovering those frequently as well and haven't really run into this.
Also if actually proceeding with some kind of bookkeeping that needs to be opt-in since large deployments can upload upwards of 100,000 WAL segments each day and iterating through those could be prohibitively slow.
It happens when:
- the replication slot is not used
- server with high traffic, many wal segments generated
- server is configured to keep just a few wal segments
- pghoard crash because of oomkill or disk full or ...
In this case pghoard cannot recover missing wal segment because server has just deleted them.
About performance, the scan is made only at startup. I made test with around 80000 wal segments on remote storage. It blocks startup of pghoard during ~ 30sec. So with higher traffic it can be much longer.
I think it's important to known what exactly is in the remote storage. If pghoard is running for weeks without restart or errors, it should be ok. I don't say that pghoard is not stable but in some environment like docker/kubernetes pghoard can be killed by the orchestrator and may not be restarted. Log can also disappear. In this case, how do you check if there are missing wal segments ? You can check manually with a script or test a restoration. So I propose to add those new metrics.