Waiting for file matching error in postgresql 9.4

Question

Waiting for file matching error in postgresql 9.4

Closed this issue 10 years ago · 13 comments

yesterday I upgrade postgres to the version 9.4, and last night the cron job (a shell script to make a basebackup) running failed, the part of script use the omnipitr like this:

 $pitr_bin_path/omnipitr-backup-slave -D $pg_data_dir \
        --call-master -d template1 -h $master_host -U $master_user -P $master_port \
        -s gzip=$xlog_dir/realtime \
        -dl gzip=$xlog_dir/$dt_today \
        --log $pitr_dir/log/base-backup.log \
        --pid-file $pitr_dir/omnipitr-backup-slave.pid \
        --temp-dir $pitr_dir/tmp \
        --psql-path $pg_bin_path/psql \
        --pgcontroldata-path $pg_bin_path/pg_controldata \
        -v

and I find out something in the logfile like this:

2015-01-21 02:07:23.105315 +0800 : 27703 : omnipitr-backup-slave : LOG : tar: Removing leading '/' from member names
2015-01-21 02:07:23.113388 +0800 : 27703 : omnipitr-backup-slave : LOG : ==============================================
    2015-01-21 02:07:23.113714 +0800 : 27703 : omnipitr-backup-slave : LOG : Timer [Compressing $PGDATA] took: 2368.344s
    2015-01-21 02:20:03.020751 +0800 : 27703 : omnipitr-backup-slave : LOG : Timer [SELECT pg_stop_backup()] took: 759.907s
    2015-01-21 02:20:03.021425 +0800 : 27703 : omnipitr-backup-slave : LOG : pg_stop_backup() returned 3A/41EF2790.
    2015-01-21 02:20:03.021873 +0800 : 27703 : omnipitr-backup-slave : LOG : Timer [Making data archive] took: 4800.603s
    2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG : Waiting for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
    2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG :  in directory xlog_archive/realtime
    2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : Waited 1 hour for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
    2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : , but it did not appear. Something is wrong. No sense in waiting longer.

but the file in the directory is :

    -rw------- 1 postgres postgres 277 Jan 21 02:07 000000010000002800000067.000369A8.backup.gz

when I execute zcat 000000010000002800000067.000369A8.backup.gz

    START WAL LOCATION: 28/CE0369A8 (file 000000010000002800000067)
    STOP WAL LOCATION: 3A/41EF2790 (file 000000010000003A00000020)
    CHECKPOINT LOCATION: 29/8FF4858
    BACKUP METHOD: pg_start_backup
    BACKUP FROM: master
    START TIME: 2015-01-21 01:00:41 CST
    LABEL: omnipitr_slave_backup_with_master_callback
    STOP TIME: 2015-01-21 02:07:23 CST

the omnipitr waiting file doesn't match the backup filename. is it a bug for omnipitr in postgresql 9.4?

Answer 1 · 2015-01-21T12:05:29.000Z

That doesn't look good, but it's hard for me to tell what happened without much more information - what was the content of .../realtime directory after some time (all files), what was current xlog position on master when backup failed. I'll try to replicate it, but it could be difficult.

Answer 2 · 2015-01-21T13:41:57.000Z

thx.
The file in the directory xlog_archive/realtime is the full xlog that master sent, the script running every day 1:00 to make basebackup of current day into the $dt_today on a slave, then move the full xlog of yesterday to the xlog_archive/$yesterday, and compress the directory of $yesterday, and rsync to the remote backup server.

The xlog position when backup failed isn't "STOP WAL LOCATION: 3A/41EF2790 (file 000000010000003A00000020)"? I think the omnipitr execute 'select pg_stop_backup()' successful, and return the xlog location 3A/41EF2790. but it can't find the file 0000000100000028000000CE.000369A8.backup.gz, but the file in the realtime is 000000010000002800000067.000369A8.backup.gz, and it's waiting for 1 hour then still can't find the file, then exit. the file 000000010000002800000067.000369A8.backup.gz is the compress of 000000010000002800000067.000369A8.backup in the $PGDATA/pg_xlog of master.

The file in the realtime are all like this : 000000010000004E0000000E.gz, a gzip file of wal file.

Answer 3 · 2015-01-21T16:02:46.000Z

I know what these files are. But again - be able to tell anything I would need more information - what I specified earlier.

Answer 4 · 2015-01-21T16:13:34.000Z

@aven92 I'm not sure it's related but aren't you missing to provide "--removal-pause-trigger (-p)" option in omnipitr-backup-slave call?

Answer 5 · 2015-01-22T04:11:59.000Z

@depesz sorry. do you mean that I should provide all wal files and a basebackup of database to you, so that you can replicate the problem? I can't provide that, because it's the production database of my company.

Answer 6 · 2015-01-22T07:01:47.000Z

@denishpatel it doesn't work too. the same error:

2015-01-22 12:29:55.739447 +0800 : 14230 : omnipitr-backup-slave : LOG : tar: Removing leading `/' from member names
2015-01-22 12:29:55.750336 +0800 : 14230 : omnipitr-backup-slave : LOG : ==============================================
2015-01-22 12:29:55.750760 +0800 : 14230 : omnipitr-backup-slave : LOG : Timer [Compressing $PGDATA] took: 2173.040s
2015-01-22 13:21:52.785756 +0800 : 14230 : omnipitr-backup-slave : LOG : Timer [SELECT pg_stop_backup()] took: 3117.035s
2015-01-22 13:21:52.798475 +0800 : 14230 : omnipitr-backup-slave : LOG : pg_stop_backup() returned EF/813538B0.
2015-01-22 13:21:52.798804 +0800 : 14230 : omnipitr-backup-slave : LOG : Timer [Making data archive] took: 6424.735s
2015-01-22 13:22:02.936749 +0800 : 14230 : omnipitr-backup-slave : LOG : Waiting for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}000000DD000000BC\.002BCCC0\.backup\.gz\z)/;
2015-01-22 13:22:02.936749 +0800 : 14230 : omnipitr-backup-slave : LOG :  in directory xlog_archive/realtime
2015-01-22 14:22:36.805504 +0800 : 14230 : omnipitr-backup-slave : FATAL : Waited 1 hour for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}000000DD000000BC\.002BCCC0\.backup\.gz\z)/;
2015-01-22 14:22:36.805504 +0800 : 14230 : omnipitr-backup-slave : FATAL : , but it did not appear. Something is wrong. No sense in waiting longer.

and the file in the realtime doesn't match:

ls -l xlog_archive/realtime/*back*
-rw------- 1 postgres postgres 273 Jan 22 03:12 xlog_archive/realtime/000000010000009A00000001.0010C8F8.backup.gz
-rw------- 1 postgres postgres 271 Jan 22 11:32 xlog_archive/realtime/00000001000000D500000001.00000EF0.backup.gz
-rw------- 1 postgres postgres 275 Jan 22 12:29 xlog_archive/realtime/00000001000000DD0000005E.002BCCC0.backup.gz

Answer 7 · 2015-01-22T07:49:38.000Z

List of files. All of them. And xlog location before, and after, trying to make a backup.

Answer 8 · 2015-01-22T08:14:31.000Z

xlog location before backup is DD/BC2BCCC0 (file 00000001000000DD0000005E)
xlog location after backup is EF/813538B0 (file 00000001000000EF00000040)
checkpoint location: DD/E28D7518

xlog on the slave

-rw------- 1 postgres postgres  5607283 Jan 22 00:00 000000010000009600000073.gz
-rw------- 1 postgres postgres  5117680 Jan 22 00:00 000000010000009600000074.gz
-rw------- 1 postgres postgres  5643909 Jan 22 00:00 000000010000009600000075.gz
-rw------- 1 postgres postgres  5973144 Jan 22 00:00 000000010000009600000076.gz
...
-rw------- 1 postgres postgres  6141202 Jan 22 16:08 00000001000000FB00000027.gz
-rw------- 1 postgres postgres  5959337 Jan 22 16:08 00000001000000FB00000028.gz
-rw------- 1 postgres postgres  6033669 Jan 22 16:08 00000001000000FB00000029.gz
-rw------- 1 postgres postgres  6497975 Jan 22 16:08 00000001000000FB0000002A.gz

-rw------- 1 postgres postgres 273 Jan 22 03:12 000000010000009A00000001.0010C8F8.backup.gz
-rw------- 1 postgres postgres 271 Jan 22 11:32 00000001000000D500000001.00000EF0.backup.gz
-rw------- 1 postgres postgres 275 Jan 22 12:29 00000001000000DD0000005E.002BCCC0.backup.gz

Answer 9 · 2015-01-23T12:39:54.000Z

Is it 100% reproductible in your case? How is the load on servers (both master and slave)?

Answer 10 · 2015-01-26T06:52:08.000Z

@depesz yes, I try it many times, the error reproduct every time.
the top of master is:

top - 14:51:39 up 72 days, 19:22,  2 users,  load average: 0.87, 1.11, 1.05
Tasks: 622 total,   7 running, 615 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.8%us,  3.9%sy,  0.2%ni, 84.7%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  132119276k total, 111022860k used, 21096416k free,   420608k buffers
Swap: 50331640k total,        0k used, 50331640k free, 100481520k cached

the top of slave is:

top - 14:50:44 up 633 days, 0 min,  6 users,  load average: 1.30, 1.25, 1.33
Tasks: 574 total,   3 running, 571 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.4%sy,  0.1%ni, 99.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  66059200k total, 65535232k used,   523968k free,    10452k buffers
Swap: 50331640k total,   678940k used, 49652700k free, 63185592k cached

Answer 11 · 2015-01-30T04:12:02.000Z

sorry, finally I find out what's wrong with it.

we changed the wal_segment_size to 32MB while we configure the postgres , so we saw the log like this:

START WAL LOCATION: 28/CE0369A8 (file 000000010000002800000067)

2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG : Waiting for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG :  in directory xlog_archive/realtime
2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : Waited 1 hour for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : , but it did not appear. Something is wrong. No sense in waiting longer.

the start backup location is :28/CE0369A8, for the 16MB size wal file, the file name should be 0000000100000028000000CE, but now it's 000000010000002800000067,.

in the file omnipitr/lib/OmniPITR/Program/Backup/Slave.pm, omnpitr calculate it is:

my ( $part_1, $part_2, $min_xlog ) = ( $1, $2, $3 );
$part_2 =~ s/(.{1,6})\z//;
my $part_3 = $1;
$self->{ 'meta' }->{ 'xlog-min' } = $min_xlog;
my $expected_filename_suffix = sprintf '%08s%08s.%08s.backup', $part_1, $part_2, $part_3;

so the result is 0000000100000028000000CE, and then cause the error for "waiting file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE.000369A8.backup.gz\z)/ " over time

and I add source code $part_2 = sprintf '%X', hex("$part_2")/2;before set the value $expected_filename_suffix,it works well.

can you fix it, thank you!

the part of pg_controldata message is this:

Database block size:                  32768
Blocks per segment of large relation: 65536
WAL block size:                       65536
Bytes per WAL segment:                33554432

the diff I change the omnipitr/lib/OmniPITR/Program/Backup/Slave.pm is:

*** Slave.pm    2015-01-30 12:10:33.000000000 +0800
--- Slave.pm.new        2015-01-30 12:09:21.427151573 +0800
***************
*** 373,378 ****
--- 373,379 ----

      my ( $part_1, $part_2, $min_xlog ) = ( $1, $2, $3 );
      $part_2 =~ s/(.{1,6})\z//;
+     $part_2 = sprintf '%X', hex("$part_2")/2;
      my $part_3 = $1;

      $self->{ 'meta' }->{ 'xlog-min' } = $min_xlog;
***************
*** 455,460 ****
--- 456,462 ----
      my ( $series, $offset ) = split m{/}, $location;

      $offset =~ s/.{0,6}$//;
+     $offset = sprintf '%X', hex("$offset")/2;

      my $location_filename = sprintf '%08s%08s%08s', $timeline, $series, $offset;

Answer 12 · 2015-01-30T05:37:35.000Z

Sorry, but I don't think I'll "fix" it. The simple reason is that I'm very unsure about potential ramifications of changing xlog size. While you definitely had your reasons to pick such thing, I don't think these are generally applicable.

And since you have to support your own fork/configuration/build of PostgreSQL anyway, I think it's only logical to modify OmniPITR to work with such modified PostgreSQL on your end.

Answer 13 · 2015-01-30T05:59:26.000Z

Given the problem happens only in custom, modified, builds of PostgreSQL, I don't think it's good idea to patch OmniPITR to handle it.
Making fork, with this patch applied seems to be simple enough.