Aiven-Open/pghoard

Replace json state file as a source of truth

rdunklau opened this issue · 0 comments

Hello,

The JSON state file is used a source of truth for the last_flushed_lsn when using the walreceiver.
This is prone to errors, as the json file is persisted only locally, and asynchronously.
This means that if we were to stop the walreceiver process on a machine and start it up elsewhere, we would lose WAL in between.
I think we should use the object storage as the source of truth instead.
In that particular case, we could probably get away with:

  • reading the value from the json_state_file if it exists. It might be stale, in that case we would simply re-archive WALs that we already archived. It's not ideal but except in the event that those already-archived WAL files have been discarded from the server it should be fine
  • when we don't have a last_flushed_lsn from the json_state_file, list xlogs from the object storage, find the latest one and compute the LSN associated to the one that would follow. It's not clear to me what are the implications of a timeline switch at this point.
    A problem with this is that listing the full xlog directory on the object storage might be expensive. Having a concept similar to pgbackrest's manifest could help with that: persisting a file on the object storage acting as a global metadata for the whole backup site.

Another approach would be to mandate the use of a replication slot, and require permission to a "maintenance" db to fetch the restart_lsn of the replication slot (as it's not possible to get it from a replication connection, although a patch has been proposed for that). This is quite invasive as it would require giving "regular" connection permission to the backup user.

I would personally be more inclined to implement the first solution, but I'm curious to have other opinions on the subject.