EnterpriseDB/repmgr

"Unknown server" on initial setup

Closed this issue · 5 comments

lukos commented

I initially followed a post on medium (slightly out of date) to setup repmgr but then went through the repmgr quick start to make sure I followed all the instructions correctly, which I think I have but after starting my standby postgresql server (after running the standby clone), it logs the following in the postgres log and I can't login to the server with psql (it says that the database is starting but it never finishes):

ERROR: Unknown server 'postgres1'
ERROR: Remote 'barman get-wal' command has failed!
2022-10-20 16:00:39.535 UTC [14631] LOG:  started streaming WAL from primary at 1/B7000000 on timeline 1
2022-10-20 16:00:39.535 UTC [14631] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000001000000B7 has already been removed

I don't know if the WAL error is directly related to the Unknown server so I am assuming for now that it is and the question is why doesn't postgres like the server postgres1, which is defined in /etc/hosts and ping and nslookup both work fine? (by the way, it is also in the hosts file of the barman server)

I have barman setup on another server so I performed the clone via barman with sudo -u postgres repmgr -h postgres1 -U repmgr -d repmgr -p 5432 -F standby clone and that all seemed to be fine. I also performed the dry run and there were no errors or warnings that looked out of place.

Is it possible the error is misleading and maybe there is an auth issue or something? I haven't got as far as registering the standby with repmgr, which also didn't work previously, so I deleted the data directory, ran the clone again but am still stuck here.

Thanks

barman get-wal is executed on the barman server. Make sure that the postgres1 host is resolvable on the barman server

lukos commented

Thanks @martinmarques. I thought of something that I might have got confused. Although the server that runs postgres1 is called postgres1, the configuration on barman is currently called pg. I suspect that in part of my configuration, I should be setting the host as pg and not postgres1.

To be honest, I should probably make the configuration have the same name as the server then I won't get confused.

Will try Monday.

lukos commented

OK, so I've changed barman to use the same name as the server (postgres1) but now after cloning the backups/logs to the replica, when I try and start up postgres, I get the following error(s).

Note that 10.240.0.6 is the IP address of the replica server.

Barman1 seems to be running OK. barman check postgres1 is all green.

2022-10-24 10:11:33.977 UTC [4053] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-10-24 10:11:33.984 UTC [4053] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-24 10:11:33.992 UTC [4054] LOG:  database system was interrupted; last known up at 2022-10-24 09:06:23 UTC
2022-10-24 10:11:33.992 UTC [4054] LOG:  creating missing WAL directory "pg_wal/archive_status"
ERROR: WAL file '00000002.history' not found in server 'postgres1' (SSH host: 10.240.0.6)
ERROR: Remote 'barman get-wal' command has failed!
2022-10-24 10:11:34.775 UTC [4054] LOG:  entering standby mode
.2022-10-24 10:11:35.336 UTC [4054] LOG:  restored log file "00000001000000020000000E" from archive
2022-10-24 10:11:35.485 UTC [4054] FATAL:  recovery aborted because of insufficient parameter settings
2022-10-24 10:11:35.485 UTC [4054] DETAIL:  max_connections = 100 is a lower setting than on the primary server, where its value was 200.
2022-10-24 10:11:35.485 UTC [4054] HINT:  You can restart the server after making the necessary configuration changes.
2022-10-24 10:11:35.487 UTC [4053] LOG:  startup process (PID 4054) exited with exit code 1
2022-10-24 10:11:35.487 UTC [4053] LOG:  aborting startup due to startup process failure
2022-10-24 10:11:35.488 UTC [4053] LOG:  database system is shut down
lukos commented

I went back to this after a few days. The wal errors had disappeared, perhaps because the logs had rolled since last time, not sure. Anyway, reading the FATAL/DETAIL/HINT by themselves and the issue was more obvious. I reset max_connections to 200 and started postgres and the log quickly said "ready to accept read-only connections".

I then ran sudo -u postgres repmgr standby -F register and it displayed the following messages (amongst others):

WARNING: node "postgres2" not found in "pg_stat_replication"
WARNING: local node not attached to primary node 1

Although it did then say it was successful and when I run the select from repmgr.nodes on the primary server, it does correctly show:

 node_id | upstream_node_id | node_name |  type
---------+------------------+-----------+---------
       1 |                  | postgres1 | primary
       2 |                1 | postgres2 | standby
(2 rows)

So I am guessing it is all working, I need to test that and work out how to have a DNS-based IP address so I can failover without too much drama. I guess my overall experience was that some more helpful messages or hints would be good unless you think I screwed it up and it should have just worked!

lukos commented

I should close this now. There are a few things that are not working out of the box with repmgr and also, I hadn't realised that if you setup repmgr on a newly created (i.e. test) cluster, you need to force the logs to roll and run a backup before you can create the replica.

I am now much further on but feel like some of the errors could provide more hints!