omniti-labs/omnipitr

Error: Cannot send segment

Closed this issue · 9 comments

I'm trying to get omnipitr-archive working to backup WAL to remote server, but keep getting a "Cannot send segment" error in the log. However, both stdout and stderr are empty, which is making it really hard to troubleshoot this problem. I'm using rsync over ssh and ssh connectivity is setup correctly. Directory permissions appear to be correct. Also tried it with a --dst-local path and it was giving similar errors. Simple archive_command from posgtresql docs works just fine. Any help would be appreciated. Thanks!

This is my archive_command

archive_command = ‘/home/postgres/omnipitr/bin/omnipitr-archive -v -dr rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/ -l /var/log/omnipitr/archive-^Y-^m-^d.log “%p”’

This is the error in the log file

2014-09-30 20:34:06.995645 -0500 : 3371 : omnipitr-archive : LOG : Called with parameters: -v -dr rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/ -l /var/log/omnipitr/archive-^Y-^m-^d.log pg_xlog/000000010000029400000033
2014-09-30 20:34:07.008578 -0500 : 3371 : omnipitr-archive : LOG : Sending ./pg_xlog/000000010000029400000033 to rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029400000033 ended in 0.003249s
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR : Cannot send segment ./pg_xlog/000000010000029400000033 to rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029400000033 : $VAR1 = {
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'is_backup' => undef,
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'stderr' => '',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'destination_file_path' => 'rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029400000033',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'status' => '2560',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'local_file' => './pg_xlog/000000010000029400000033',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'destination_type' => 'remote',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'ended' => '1412127247.00817',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'pid' => 3372,
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'dst_path' => 'rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'started' => '1412127247.00492',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'stdout' => '',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :           'command' => [
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :                          'rsync',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :                          './pg_xlog/000000010000029400000033',
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :                          'rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029400000033'
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :                        ]
2014-09-30 20:34:07.012623 -0500 : 3371 : omnipitr-archive : ERROR :         };
2014-09-30 20:34:07.019662 -0500 : 3371 : omnipitr-archive : FATAL : There are fatal errors. Dying.

So, it looks that the rsync didn't print anything (which is kinda strange), but luckily, we have "status".

Parsing it is unfortunately not so simple, so I'll skip explanation for now - long story short, it tells us that rsync exited with status code = 10.

In man rsync, I can see that exit value 10 means: "Error in socket I/O".

That's not much, but it's a start.

Try, as your postgres user, cd $PGDATA (i.e. change working directory to "data_directory" of Pg. And then run:

rsync -vv ./pg_xlog/000000010000029400000033 rsync://pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029400000033

That should provide more information.

Also - sorry for lag, was rather busy(ish) lately.

One thing that struck me - you're using rsync:// prefix - are you sure you have rsync daemon on the backup server? As opposed to rsync over ssh? For rsync over ssh, the correct format is host:/path. omnipitr-archive docs say:

Destination url/location should be in a format that is usable by rsync program.

and in man rsync it's probably not clear enough, but the general thing is that rsync:// is only when using plain rsync (i.e. not over ssh) - which is faster, and less cpu intensive, but usually also less secure. And definitely less common.

Thanks for getting back to me. You were right that rysnc:// is not the right syntax for rsync over ssh (gave me an error when I tried the rsync command you listed). So I tried the following command as postgres from the $PGDATA directory and it worked no problem

bash-4.1$ rsync -vv ./pg_xlog/000000010000029F00000059 pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029F00000059
opening connection using: ssh -l pgbackup 192.168.11.108 rsync --server -vve.Ls . /home/pgbackup/wal_archive/000000010000029F00000059 
delta-transmission enabled
000000010000029F00000059
total: matches=0  hash_hits=0  false_alarms=0 data=16777216

sent 16779349 bytes  received 31 bytes  11186253.33 bytes/sec
total size is 16777216  speedup is 1.00

Since, that worked I updated my archive command to the one below and tried again

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /home/postgres/pgsql/data/ -v -dr pgbackup@192.168.11.108:/home/pgbackup/wal_archive/ -l /var/log/omnipitr/archive-^Y-^m-^d.log "%p"'

However, it still didn't work. Continued to get similar errors as before

2014-10-02 17:30:47.409032 -0500 : 17955 : omnipitr-archive : LOG : Called with parameters: -D /home/postgres/pgsql/data/ -v -dr pgbackup@192.168.11.108:/home/pgbackup/wal_archive/ -l /var/log/omnipitr/archive-^Y-^m-^d.log pg_xlog/000000010000029F00000059
2014-10-02 17:30:47.418312 -0500 : 17955 : omnipitr-archive : LOG : Sending /home/postgres/pgsql/data/pg_xlog/000000010000029F00000059 to pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029F00000059 ended in 0.003873s
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR : Cannot send segment /home/postgres/pgsql/data/pg_xlog/000000010000029F00000059 to pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029F00000059 : $VAR1 = {
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'is_backup' => undef,
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'stderr' => '',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'destination_file_path' => 'pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029F00000059',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'status' => '3584',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'local_file' => '/home/postgres/pgsql/data/pg_xlog/000000010000029F00000059',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'destination_type' => 'remote',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'ended' => '1412289047.41795',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'pid' => 17956,
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'dst_path' => 'pgbackup@192.168.11.108:/home/pgbackup/wal_archive/',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'started' => '1412289047.41407',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'stdout' => '',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :           'command' => [
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :                          'rsync',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :                          '/home/postgres/pgsql/data/pg_xlog/000000010000029F00000059',
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :                          'pgbackup@192.168.11.108:/home/pgbackup/wal_archive/000000010000029F00000059'
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :                        ]
2014-10-02 17:30:47.424686 -0500 : 17955 : omnipitr-archive : ERROR :         };
2014-10-02 17:30:47.431759 -0500 : 17955 : omnipitr-archive : FATAL : There are fatal errors. Dying.

What's interesting is that local destinations don't work either. For e.g. this archive command does not work

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /home/postgres/pgsql/data/ -dl /backup/wal_archive/ -l /var/log/omnipitr/archive-^Y-^m-^d.log "%p"'

Gives essentially the same error as the remote destination

2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR : Cannot send segment /home/postgres/pgsql/data/pg_xlog/000000010000029F00000059 to /backup/wal_archive/000000010000029F00000059 : $VAR1 = {
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'is_backup' => undef,
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'stderr' => '',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'destination_file_path' => '/backup/wal_archive/000000010000029F00000059',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'status' => '3072',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'local_file' => '/home/postgres/pgsql/data/pg_xlog/000000010000029F00000059',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'destination_type' => 'local',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'ended' => '1412289980.0927',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'pid' => 18266,
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'dst_path' => '/backup/wal_archive/',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'started' => '1412289979.9891',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'stdout' => '',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :           'command' => [
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :                          'rsync',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :                          '/home/postgres/pgsql/data/pg_xlog/000000010000029F00000059',
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :                          '/backup/wal_archive/000000010000029F00000059'
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :                        ]
2014-10-02 17:46:20.092917 -0500 : 18265 : omnipitr-archive : ERROR :         };
2014-10-02 17:46:20.097237 -0500 : 18265 : omnipitr-archive : FATAL : There are fatal errors. Dying.

status 3584 means error core 14 (Error in IPC code), and status 3072 is error code 12 - Error in rsync protocol data stream.

I have no idea what's going on - never seen anything like this. It is clearly problem with rsync, but why - no idea.

It can be probably debugged using strace, but that's rather not simple tool.

Maybe you're running different rsync? I.e. in your shell, you have PATH set so that you use different rsync than what omnipitr-archive, called by pg, runs? Or perhaps it's some wrapper on rsync? Or some old version? Check all of this (version, path to binary, strace of postgres archiving process with fork following) and then we can talk further. Long story short is that I don't see here any problem with omnipitr itself, just something weird happening with rsync.

Thanks for your help. After debugging with strace I was able to narrow down the problem to an selinux issue with rsync. This article helped me resolve the problem http://danwalsh.livejournal.com/61646.html.

@khan8037 How were you able to resolve this? I'm running into this issue with a clean install of CentOS6.5. Checked that blog, but never really configured SELinux in this manner before.

I ended up changing the selinux label of /usr/bin/rsync to bin_t. I also changed the selinux boolean postgresql_can_rsync to true. The exact commands I used were

# semanage fcontext -a -t bin_t "/usr/bin/rsync"
# restorecon -R -v /usr/bin/rsync
# setsebool -P postgresql_can_rsync on

Thank you so much!
How in the world anyone without some pretty advanced sysadmin knowledge would ever figure this out from a base install is beyond me.

Yeah - what Keith said. Thanks for the info. not sure how common it is, but I would be 100% at loss if it would ever hit me.