Postgres 13 + repmgr 5.2.1 at standby clone stuck in pg_basebackup

Question

Postgres 13 + repmgr 5.2.1 at standby clone stuck in pg_basebackup

Closed this issue 3 years ago · 5 comments

Hi!

I have no idea what's wrong.

I execute sudo -u postgres repmgr -h pgsql1 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone -F --log-level DEBUG --verbose

and then repmgr ist stuck with following output:

NOTICE: using provided configuration file "/etc/repmgr.conf"
NOTICE: destination directory "/data/postgresql/13/main" provided
DEBUG: check_source_server()
INFO: connecting to source node
DETAIL: connection string is: host=pgsql1 user=repmgr dbname=repmgr
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_cluster_size():
SELECT pg_catalog.pg_size_pretty(pg_catalog.sum(pg_catalog.pg_database_size(oid))::bigint)       FROM pg_catalog.pg_database
DETAIL: current installation size is 31 MB
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_all_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: 1 node records returned by source node
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=pgsql1 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: get_primary_node_id():
SELECT node_id                   FROM repmgr.nodes     WHERE type = 'primary'    AND active IS TRUE
DEBUG: upstream_node_id determined as 1
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
DEBUG: upstream_user is "repmgr"
DEBUG: get_node_record_by_name():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_name = 'pgsql2'
DEBUG: get_node_record_by_name(): no record found for node "pgsql2"
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'wal_level' AND setting = 'replica'
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'archive_command' AND setting != ''
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'hot_standby' AND setting = 'on'
DEBUG: get_pg_setting():
  SELECT name, setting   FROM pg_catalog.pg_settings WHERE name = 'max_wal_senders'
DEBUG: get_pg_setting_int(): returned value is "10"
NOTICE: checking for available walsenders on the source node (2 required)
DEBUG: get_node_replication_stats():
 SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers,         current_setting('max_replication_slots')::INT AS max_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots,
(SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical')  AS active_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots,         pg_catalog.pg_is_in_recovery() AS in_recovery
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: sufficient replication connections could be made to the source server (2 required)
DEBUG: get_pg_setting():
  SELECT name, setting   FROM pg_catalog.pg_settings WHERE name = 'data_checksums'
DEBUG: get_pg_setting(): returned value is "FALSE"
DEBUG: get_pg_setting():
  SELECT name, setting   FROM pg_catalog.pg_settings WHERE name = 'wal_log_hints'
DEBUG: get_pg_setting(): returned value is "FALSE"
WARNING: data checksums are not enabled and "wal_log_hints" is "off"
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
DEBUG: parsing upstream conninfo string "host=pgsql1 user=repmgr dbname=repmgr connect_timeout=2"
WARNING: directory "/data/postgresql/13/main" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/data/postgresql/13/main"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /data/postgresql/13/main -h pgsql1 -p 5432 -U repmgr -X stream

Any idea what's wrong?

Answer 1 · 2021-06-15T07:15:08.000Z

I have no idea what's wrong.

I execute sudo -u postgres repmgr -h pgsql1 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone -F --log-level DEBUG --verbose

and then repmgr ist stuck with following output:

(...)
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing: pg_basebackup -l "repmgr base backup" -D /data/postgresql/13/main -h pgsql1 -p 5432 -U repmgr -X stream

Any idea what's wrong?

Quite possibly pg_basebackup is waiting for a checkpoint to occur. See:

https://www.postgresql.org/docs/current/app-pgbasebackup.html#id-1.9.4.9.8

As you are on PostgreSQL 13, SELECT * FROM pg_stat_progress_basebackup;
might give some clues as to what's happening.

See also:

Regards

Ian Barwick

Answer 2 · 2021-06-15T07:19:50.000Z

Thanks @ibarwick!

I get the following output from stat progress:

-[ RECORD 1 ]--------+-------------------------
pid                  | 16028
phase                | streaming database files
backup_total         | 33048064
backup_streamed      | 391168
tablespaces_total    | 1
tablespaces_streamed | 0

pg_basebackup: Basissicherung eingeleitet, warte auf Abschluss des Checkpoints
pg_basebackup: Checkpoint abgeschlossen
pg_basebackup: Write-Ahead-Log-Startpunkt: 0/26000028 auf Zeitleiste 1
pg_basebackup: Hintergrund-WAL-Receiver wird gestartet
pg_basebackup: temporärer Replikations-Slot »pg_basebackup_16107« wurde erzeugt
    0/32273 kB (0%), 0/1 Tablespace (.../postgresql/13/main/backup_label)

postgres=# SELECT * FROM pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 16175
usesysid         | 16384
usename          | repmgr
application_name | pg_basebackup
client_addr      | 10.10.10.42
client_hostname  |
client_port      | 43218
backend_start    | 2021-06-15 09:25:08.613556+02
backend_xmin     |
state            | backup
sent_lsn         |
write_lsn        |
flush_lsn        |
replay_lsn       |
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       |
-[ RECORD 2 ]----+------------------------------
pid              | 16178
usesysid         | 16384
usename          | repmgr
application_name | pg_basebackup
client_addr      | 10.10.10.42
client_hostname  |
client_port      | 43220
backend_start    | 2021-06-15 09:25:08.78911+02
backend_xmin     |
state            | streaming
sent_lsn         | 0/280000D8
write_lsn        | 0/28000000
flush_lsn        | 0/28000000
replay_lsn       |
write_lag        | 00:00:00.002866
flush_lag        | 00:00:00.002866
replay_lag       | 00:00:00.002866
sync_priority    | 1
sync_state       | sync
reply_time       | 2021-06-15 09:25:08.854049+02

It's stuck in this step! My database is newly created so there is nearly no data to transfer...

Any ideas?

Best Ragards!
David

Answer 3 · 2021-06-15T07:28:13.000Z

From that, pg_basebackup has transferred ca. 400Kb of 33Mb, maybe there is some kind of connectivity issue between primary and standby? Is the backup_streamed number increasing? If not, it might be worth interrupting the process and running the pg_basebackup command manually with the --progress option, e.g.:

pg_basebackup -l "repmgr base backup"  -D /data/postgresql/13/main -h pgsql1 -p 5432 -U repmgr -X stream --progress

to see if that provides any clues (this is all outside repmgr's control). Delete the contents of /data/postgresql/13/main before doing that.

Answer 4 · 2021-06-15T07:31:54.000Z

Hi,

i can not find any connectivity issues. Ping works in both direction without problem! Also psql client works fine.

Output above is from doing the command manually...

I really have no idea whats going wrong :-(

Answer 5 · 2021-06-15T07:38:30.000Z

Hi @ibarwick!

Sorry for wasting your time! Actually i found the issue. MTU Size on the interfaces was to high. Seems some packets got damaged!

Strange that there is no logging or anything like this from postgres side..

Thanks for your help!

Best Regards from austria!

David