omniti-labs/omnipitr

Waiting for file matching error in postgresql 9.4

Closed this issue · 13 comments

yesterday I upgrade postgres to the version 9.4, and last night the cron job (a shell script to make a basebackup) running failed, the part of script use the omnipitr like this:

 $pitr_bin_path/omnipitr-backup-slave -D $pg_data_dir \
        --call-master -d template1 -h $master_host -U $master_user -P $master_port \
        -s gzip=$xlog_dir/realtime \
        -dl gzip=$xlog_dir/$dt_today \
        --log $pitr_dir/log/base-backup.log \
        --pid-file $pitr_dir/omnipitr-backup-slave.pid \
        --temp-dir $pitr_dir/tmp \
        --psql-path $pg_bin_path/psql \
        --pgcontroldata-path $pg_bin_path/pg_controldata \
        -v

and I find out something in the logfile like this:

2015-01-21 02:07:23.105315 +0800 : 27703 : omnipitr-backup-slave : LOG : tar: Removing leading '/' from member names
2015-01-21 02:07:23.113388 +0800 : 27703 : omnipitr-backup-slave : LOG : ==============================================
    2015-01-21 02:07:23.113714 +0800 : 27703 : omnipitr-backup-slave : LOG : Timer [Compressing $PGDATA] took: 2368.344s
    2015-01-21 02:20:03.020751 +0800 : 27703 : omnipitr-backup-slave : LOG : Timer [SELECT pg_stop_backup()] took: 759.907s
    2015-01-21 02:20:03.021425 +0800 : 27703 : omnipitr-backup-slave : LOG : pg_stop_backup() returned 3A/41EF2790.
    2015-01-21 02:20:03.021873 +0800 : 27703 : omnipitr-backup-slave : LOG : Timer [Making data archive] took: 4800.603s
    2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG : Waiting for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
    2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG :  in directory xlog_archive/realtime
    2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : Waited 1 hour for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
    2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : , but it did not appear. Something is wrong. No sense in waiting longer.

but the file in the directory is :

    -rw------- 1 postgres postgres 277 Jan 21 02:07 000000010000002800000067.000369A8.backup.gz

when I execute zcat 000000010000002800000067.000369A8.backup.gz

    START WAL LOCATION: 28/CE0369A8 (file 000000010000002800000067)
    STOP WAL LOCATION: 3A/41EF2790 (file 000000010000003A00000020)
    CHECKPOINT LOCATION: 29/8FF4858
    BACKUP METHOD: pg_start_backup
    BACKUP FROM: master
    START TIME: 2015-01-21 01:00:41 CST
    LABEL: omnipitr_slave_backup_with_master_callback
    STOP TIME: 2015-01-21 02:07:23 CST

the omnipitr waiting file doesn't match the backup filename. is it a bug for omnipitr in postgresql 9.4?

That doesn't look good, but it's hard for me to tell what happened without much more information - what was the content of .../realtime directory after some time (all files), what was current xlog position on master when backup failed. I'll try to replicate it, but it could be difficult.

thx.
The file in the directory xlog_archive/realtime is the full xlog that master sent, the script running every day 1:00 to make basebackup of current day into the $dt_today on a slave, then move the full xlog of yesterday to the xlog_archive/$yesterday, and compress the directory of $yesterday, and rsync to the remote backup server.

The xlog position when backup failed isn't "STOP WAL LOCATION: 3A/41EF2790 (file 000000010000003A00000020)"? I think the omnipitr execute 'select pg_stop_backup()' successful, and return the xlog location 3A/41EF2790. but it can't find the file 0000000100000028000000CE.000369A8.backup.gz, but the file in the realtime is 000000010000002800000067.000369A8.backup.gz, and it's waiting for 1 hour then still can't find the file, then exit. the file 000000010000002800000067.000369A8.backup.gz is the compress of 000000010000002800000067.000369A8.backup in the $PGDATA/pg_xlog of master.

The file in the realtime are all like this : 000000010000004E0000000E.gz, a gzip file of wal file.

I know what these files are. But again - be able to tell anything I would need more information - what I specified earlier.

@aven92 I'm not sure it's related but aren't you missing to provide "--removal-pause-trigger (-p)" option in omnipitr-backup-slave call?

@depesz sorry. do you mean that I should provide all wal files and a basebackup of database to you, so that you can replicate the problem? I can't provide that, because it's the production database of my company.

@denishpatel it doesn't work too. the same error:

2015-01-22 12:29:55.739447 +0800 : 14230 : omnipitr-backup-slave : LOG : tar: Removing leading `/' from member names
2015-01-22 12:29:55.750336 +0800 : 14230 : omnipitr-backup-slave : LOG : ==============================================
2015-01-22 12:29:55.750760 +0800 : 14230 : omnipitr-backup-slave : LOG : Timer [Compressing $PGDATA] took: 2173.040s
2015-01-22 13:21:52.785756 +0800 : 14230 : omnipitr-backup-slave : LOG : Timer [SELECT pg_stop_backup()] took: 3117.035s
2015-01-22 13:21:52.798475 +0800 : 14230 : omnipitr-backup-slave : LOG : pg_stop_backup() returned EF/813538B0.
2015-01-22 13:21:52.798804 +0800 : 14230 : omnipitr-backup-slave : LOG : Timer [Making data archive] took: 6424.735s
2015-01-22 13:22:02.936749 +0800 : 14230 : omnipitr-backup-slave : LOG : Waiting for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}000000DD000000BC\.002BCCC0\.backup\.gz\z)/;
2015-01-22 13:22:02.936749 +0800 : 14230 : omnipitr-backup-slave : LOG :  in directory xlog_archive/realtime
2015-01-22 14:22:36.805504 +0800 : 14230 : omnipitr-backup-slave : FATAL : Waited 1 hour for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}000000DD000000BC\.002BCCC0\.backup\.gz\z)/;
2015-01-22 14:22:36.805504 +0800 : 14230 : omnipitr-backup-slave : FATAL : , but it did not appear. Something is wrong. No sense in waiting longer.

and the file in the realtime doesn't match:

ls -l xlog_archive/realtime/*back*
-rw------- 1 postgres postgres 273 Jan 22 03:12 xlog_archive/realtime/000000010000009A00000001.0010C8F8.backup.gz
-rw------- 1 postgres postgres 271 Jan 22 11:32 xlog_archive/realtime/00000001000000D500000001.00000EF0.backup.gz
-rw------- 1 postgres postgres 275 Jan 22 12:29 xlog_archive/realtime/00000001000000DD0000005E.002BCCC0.backup.gz

List of files. All of them. And xlog location before, and after, trying to make a backup.

xlog location before backup is DD/BC2BCCC0 (file 00000001000000DD0000005E)
xlog location after backup is EF/813538B0 (file 00000001000000EF00000040)
checkpoint location: DD/E28D7518

xlog on the slave

-rw------- 1 postgres postgres  5607283 Jan 22 00:00 000000010000009600000073.gz
-rw------- 1 postgres postgres  5117680 Jan 22 00:00 000000010000009600000074.gz
-rw------- 1 postgres postgres  5643909 Jan 22 00:00 000000010000009600000075.gz
-rw------- 1 postgres postgres  5973144 Jan 22 00:00 000000010000009600000076.gz
...
-rw------- 1 postgres postgres  6141202 Jan 22 16:08 00000001000000FB00000027.gz
-rw------- 1 postgres postgres  5959337 Jan 22 16:08 00000001000000FB00000028.gz
-rw------- 1 postgres postgres  6033669 Jan 22 16:08 00000001000000FB00000029.gz
-rw------- 1 postgres postgres  6497975 Jan 22 16:08 00000001000000FB0000002A.gz

-rw------- 1 postgres postgres 273 Jan 22 03:12 000000010000009A00000001.0010C8F8.backup.gz
-rw------- 1 postgres postgres 271 Jan 22 11:32 00000001000000D500000001.00000EF0.backup.gz
-rw------- 1 postgres postgres 275 Jan 22 12:29 00000001000000DD0000005E.002BCCC0.backup.gz

Is it 100% reproductible in your case? How is the load on servers (both master and slave)?

@depesz yes, I try it many times, the error reproduct every time.
the top of master is:

top - 14:51:39 up 72 days, 19:22,  2 users,  load average: 0.87, 1.11, 1.05
Tasks: 622 total,   7 running, 615 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.8%us,  3.9%sy,  0.2%ni, 84.7%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  132119276k total, 111022860k used, 21096416k free,   420608k buffers
Swap: 50331640k total,        0k used, 50331640k free, 100481520k cached

the top of slave is:

top - 14:50:44 up 633 days, 0 min,  6 users,  load average: 1.30, 1.25, 1.33
Tasks: 574 total,   3 running, 571 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.4%sy,  0.1%ni, 99.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  66059200k total, 65535232k used,   523968k free,    10452k buffers
Swap: 50331640k total,   678940k used, 49652700k free, 63185592k cached

sorry, finally I find out what's wrong with it.

we changed the wal_segment_size to 32MB while we configure the postgres , so we saw the log like this:

START WAL LOCATION: 28/CE0369A8 (file 000000010000002800000067)

2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG : Waiting for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
2015-01-21 02:20:13.229817 +0800 : 27703 : omnipitr-backup-slave : LOG :  in directory xlog_archive/realtime
2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : Waited 1 hour for file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE\.000369A8\.backup\.gz\z)/;
2015-01-21 03:20:59.127915 +0800 : 27703 : omnipitr-backup-slave : FATAL : , but it did not appear. Something is wrong. No sense in waiting longer.

the start backup location is :28/CE0369A8, for the 16MB size wal file, the file name should be 0000000100000028000000CE, but now it's 000000010000002800000067,.

in the file omnipitr/lib/OmniPITR/Program/Backup/Slave.pm, omnpitr calculate it is:

my ( $part_1, $part_2, $min_xlog ) = ( $1, $2, $3 );
$part_2 =~ s/(.{1,6})\z//;
my $part_3 = $1;
$self->{ 'meta' }->{ 'xlog-min' } = $min_xlog;
my $expected_filename_suffix = sprintf '%08s%08s.%08s.backup', $part_1, $part_2, $part_3;

so the result is 0000000100000028000000CE, and then cause the error for "waiting file matching $VAR1 = qr/(?-xism:\A[0-9A-F]{8}00000028000000CE.000369A8.backup.gz\z)/ " over time

and I add source code $part_2 = sprintf '%X', hex("$part_2")/2;before set the value $expected_filename_suffix,it works well.

can you fix it, thank you!

the part of pg_controldata message is this:

Database block size:                  32768
Blocks per segment of large relation: 65536
WAL block size:                       65536
Bytes per WAL segment:                33554432

the diff I change the omnipitr/lib/OmniPITR/Program/Backup/Slave.pm is:

*** Slave.pm    2015-01-30 12:10:33.000000000 +0800
--- Slave.pm.new        2015-01-30 12:09:21.427151573 +0800
***************
*** 373,378 ****
--- 373,379 ----

      my ( $part_1, $part_2, $min_xlog ) = ( $1, $2, $3 );
      $part_2 =~ s/(.{1,6})\z//;
+     $part_2 = sprintf '%X', hex("$part_2")/2;
      my $part_3 = $1;

      $self->{ 'meta' }->{ 'xlog-min' } = $min_xlog;
***************
*** 455,460 ****
--- 456,462 ----
      my ( $series, $offset ) = split m{/}, $location;

      $offset =~ s/.{0,6}$//;
+     $offset = sprintf '%X', hex("$offset")/2;

      my $location_filename = sprintf '%08s%08s%08s', $timeline, $series, $offset;

Sorry, but I don't think I'll "fix" it. The simple reason is that I'm very unsure about potential ramifications of changing xlog size. While you definitely had your reasons to pick such thing, I don't think these are generally applicable.

And since you have to support your own fork/configuration/build of PostgreSQL anyway, I think it's only logical to modify OmniPITR to work with such modified PostgreSQL on your end.

Given the problem happens only in custom, modified, builds of PostgreSQL, I don't think it's good idea to patch OmniPITR to handle it.
Making fork, with this patch applied seems to be simple enough.