"pg_probackup-15 archive-get" ignores archived wals while recovering to latest state
gsl23 opened this issue · 2 comments
Hi guys ! I'm testing wal archive recovery and cant to understand why archive-get just lose some last wals :
postgresql.auto.conf:
Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
# recovery settings added by pg_probackup restore of backup SE7QA1 at '2024-05-29 15:19:25+05'
## recovery settings
recovery_target_timeline = 'current'
recovery_target_action = 'pause'
restore_command = '"/usr/bin/pg_probackup-15" archive-get -B "/backup/postgres" --instance "pgw-db-p02" --wal-file-path=%p --wal-file-name=%f --remote-host=bkp_host'
wals in pg_probackup archive :
-rw-------. 1 postgres postgres 95587 May 29 10:41 0000000A0000000000000031.gz
-rw-------. 1 postgres postgres 95648 May 29 11:11 0000000A0000000000000032.gz
-rw-------. 1 postgres postgres 96702 May 29 11:42 0000000A0000000000000033.gz
-rw-------. 1 postgres postgres 106265 May 29 12:12 0000000A0000000000000034.gz
-rw-------. 1 postgres postgres 95951 May 29 12:41 0000000A0000000000000035.gz
-rw-------. 1 postgres postgres 79297 May 29 12:44 0000000A0000000000000036.gz
-rw-------. 1 postgres postgres 16777216 May 29 12:44 0000000A0000000000000037.partial
-rw-------. 1 postgres postgres 378 May 28 19:11 0000000A.history
-rw-------. 1 postgres postgres 128597 May 29 13:14 0000000B0000000000000037.gz
-rw-------. 1 postgres postgres 79741 May 29 13:18 0000000B0000000000000038.gz
-rw-------. 1 postgres postgres 83013 May 29 13:25 0000000B0000000000000039.gz
-rw-------. 1 postgres postgres 98973 May 29 13:55 0000000B000000000000003A.gz
-rw-------. 1 postgres postgres 100455 May 29 14:25 0000000B000000000000003B.gz
-rw-------. 1 postgres postgres 101462 May 29 14:55 0000000B000000000000003C.gz
-rw-------. 1 postgres postgres 117154 May 29 15:25 0000000B000000000000003D.gz
-rw-------. 1 postgres postgres 422 May 29 12:44 0000000B.history
postgresql.log while recovery :
2024-05-29 15:24:58 +05 [400240]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A0000000000000037, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400240]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 145ms
2024-05-29 15:24:59.040 +05 [400156] LOG: restored log file "0000000A0000000000000037" from archive
2024-05-29 15:24:59.052 +05 [400156] LOG: invalid resource manager ID 114 at 0/370000A0
2024-05-29 15:24:59.052 +05 [400156] LOG: redo done at 0/37000028 system usage: CPU: user: 0.02 s, system: 0.17 s, elapsed: 5.91 s
2024-05-29 15:24:59.052 +05 [400156] LOG: last completed transaction was at log time 2024-05-29 12:44:00.017459+05
2024-05-29 15:24:59 +05 [400247]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A0000000000000037, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400247]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 140ms
2024-05-29 15:24:59.199 +05 [400156] LOG: restored log file "0000000A0000000000000037" from archive
2024-05-29 15:24:59 +05 [400249]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000B.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400249]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 144ms
2024-05-29 15:24:59.369 +05 [400156] LOG: restored log file "0000000B.history" from archive
2024-05-29 15:24:59 +05 [400251]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000C.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400251]: [1-1]: ERROR: pg_probackup archive-get failed to deliver WAL file: 0000000C.history, time elapsed: 186ms
2024-05-29 15:24:59.566 +05 [400156] LOG: selected new timeline ID: 12
2024-05-29 15:24:59 +05 [400254]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400254]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 139ms
2024-05-29 15:24:59.784 +05 [400156] LOG: restored log file "0000000A.history" from archive
2024-05-29 15:24:59.793 +05 [400156] LOG: archive recovery complete
Why it is ended on 0000000A0000000000000037 last completed transaction was at log time 2024-05-29 12:44:00.017459+05 ??
Where is
-rw-------. 1 postgres postgres 79741 May 29 13:18 0000000B0000000000000038.gz
-rw-------. 1 postgres postgres 83013 May 29 13:25 0000000B0000000000000039.gz
etc.... ???
At 2024-05-29 12:44 standby instance have been promoted to primary :
2024-05-29 12:44:17.079 +05 [33061] LOG: replication terminated by primary server
2024-05-29 12:44:17.079 +05 [33061] DETAIL: End of WAL reached on timeline 10 at 0/370000A0.
2024-05-29 12:44:17.079 +05 [33061] FATAL: could not send end-of-streaming message to primary: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
2024-05-29 12:44:17.087 +05 [33048] LOG: invalid resource manager ID 114 at 0/370000A0
2024-05-29 12:44:17.109 +05 [350162] FATAL: could not connect to the primary server: connection to server at "192.168.130.23", port 5432 failed: FATAL: the
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
2024-05-29 12:44:17.112 +05 [33048] LOG: waiting for WAL to become available at 0/370000B8
2024-05-29 12:44:17 +05 [350158]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000A0000000000000036, threads: 1/1, batch: 1/1, compression: zlib
2024-05-29 12:44:17 +05 [350158]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 66ms
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
2024-05-29 12:44:17.887 +05 [33048] LOG: received promote request
2024-05-29 12:44:17.887 +05 [33048] LOG: redo done at 0/37000028 system usage: CPU: user: 0.75 s, system: 0.83 s, elapsed: 63143.22 s
2024-05-29 12:44:17.887 +05 [33048] LOG: last completed transaction was at log time 2024-05-29 12:44:00.017459+05
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
2024-05-29 12:44:17.898 +05 [33048] LOG: selected new timeline ID: 11
cp: cannot stat '/pgsql/data/../wal_archive/0000000A.history': No such file or directory
2024-05-29 12:44:18.037 +05 [33048] LOG: archive recovery complete
2024-05-29 12:44:18.068 +05 [33046] LOG: checkpoint starting: force
2024-05-29 12:44:18.075 +05 [33042] LOG: database system is ready to accept connections
2024-05-29 12:44:18.082 +05 [350180] LOG: pg_cron scheduler started
2024-05-29 12:44:18 +05 [350176]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000B.history, threads: 1/1, batch: 1/1, compression: zlib
2024-05-29 12:44:18 +05 [350176]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 6ms
2024-05-29 12:44:18 +05 [350187]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000A0000000000000037.partial, threads: 1/1, batch: 1/1, compression: zl
2024-05-29 12:44:18 +05 [350187]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 78ms
2024-05-29 12:44:19.011 +05 [33046] LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.897 s, sync=0.011
2024-05-29 12:44:19.011 +05 [33046] LOG: checkpoint starting: immediate force wait
2024-05-29 12:44:19.060 +05 [33046] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.009 s, sync=0.001 s
2024-05-29 12:45:00.007 +05 [350180] LOG: cron job 1 starting:
Do we have any chances to apply later wals after this occasion ?
Looks like it is pg_probackup archive-get problem.
I copied all needed wals manually to server local dir and setted up restore by cp :
# restore_command = '"/usr/bin/pg_probackup-15" archive-get -B "/backup/postgres" --instance "pgw-db-p02" --wal-file-path=%p --wal-file-name=%f'
restore_command = 'cp /pgsql/data/../wal_archive2/%f %p'
And now its recovered forward after 0000000B0000000000000037 without any problems:
2024-05-29 17:32:07.935 +05 [439475] LOG: restored log file "0000000A0000000000000032" from archive
2024-05-29 17:32:08.025 +05 [439475] LOG: restored log file "0000000A0000000000000033" from archive
2024-05-29 17:32:08.115 +05 [439475] LOG: restored log file "0000000A0000000000000034" from archive
2024-05-29 17:32:08.193 +05 [439475] LOG: restored log file "0000000A0000000000000035" from archive
2024-05-29 17:32:08.271 +05 [439475] LOG: restored log file "0000000A0000000000000036" from archive
2024-05-29 17:32:08.355 +05 [439475] LOG: restored log file "0000000B0000000000000037" from archive
2024-05-29 17:32:08.436 +05 [439475] LOG: restored log file "0000000B0000000000000038" from archive
2024-05-29 17:32:08.528 +05 [439475] LOG: restored log file "0000000B0000000000000039" from archive
2024-05-29 17:32:08.606 +05 [439475] LOG: restored log file "0000000B000000000000003A" from archive
2024-05-29 17:32:08.693 +05 [439475] LOG: restored log file "0000000B000000000000003B" from archive
2024-05-29 17:32:08.770 +05 [439475] LOG: restored log file "0000000B000000000000003C" from archive
2024-05-29 17:32:08.865 +05 [439475] LOG: restored log file "0000000B000000000000003D" from archive
2024-05-29 17:32:08.946 +05 [439475] LOG: restored log file "0000000B000000000000003E" from archive
2024-05-29 17:32:09.011 +05 [439475] LOG: recovery stopping before commit of transaction 8004, time 2024-05-29 15:46:00.001085+05
2024-05-29 17:32:09.011 +05 [439475] LOG: pausing at the end of recovery
2024-05-29 17:32:09.011 +05 [439475] HINT: Execute pg_wal_replay_resume() to promote.
2024-05-29 17:33:27.613 +05 [439475] LOG: redo done at 0/3E00FB88 system usage: CPU: user: 0.02 s, system: 0.20 s, elapsed: 81.25 s
2024-05-29 17:33:27.613 +05 [439475] LOG: last completed transaction was at log time 2024-05-29 15:45:00.024303+05
So, my test case :
- Set up standby and pg_probackup with wal archiving
- Take full backup from standby
- Promote standby
- Try to restore by using full backup(before promotion) and apply wals to timeline after promotion.