The cluster stops functioning after replacing a failed unit then another unit failed
Opened this issue · 2 comments
Steps to reproduce
- prepare a MAAS provider
- deploy a 3-node cluster
juju deploy postgresql --base ubuntu@22.04 --channel 14/stable -n 3
- take down one unit by simulating a hardware failure
- remove the failed unit from the model
- add another unit to have 3 nodes again
- take down another unit by simulating a hardware failure
Expected behavior
The cluster keeps working since there are two living nodes out of 3 (the quorum should be satisfied).
Actual behavior
The cluster is not operational.
$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-06 06:12:20,353 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:12:25,425 - INFO - waiting on raft
2024-08-06 06:12:30,425 - INFO - waiting on raft
2024-08-06 06:12:35,426 - INFO - waiting on raft
2024-08-06 06:12:40,426 - INFO - waiting on raft
^C
Aborted!
The charm states postgresql/0 is the primary but it's not true since there is no postgresql process running in postgresql/0 any longer.
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 06:17:39Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/3 unknown lost 3 192.168.151.115 5432/tcp agent lost, see 'juju show-status-log postgresql/3'
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.113 machine-3 ubuntu@22.04 default Deployed
3 down 192.168.151.115 machine-4 ubuntu@22.04 default Deployed
$ juju run postgresql/leader get-primary
Running operation 7 with 1 task
- task 8 on unit-postgresql-0
Waiting for task 8...
primary: postgresql/0
$ juju ssh postgresql/0 -- pgrep -af postgres
7638 /snap/charmed-postgresql/115/usr/bin/prometheus-postgres-exporter
36668 python3 /snap/charmed-postgresql/115/usr/bin/patroni /var/snap/charmed-postgresql/115/etc/patroni/patroni.yaml
36920 /usr/bin/python3 src/cluster_topology_observer.py http://192.168.151.112:8008 True /usr/bin/juju-exec postgresql/0 /var/lib/juju/agents/unit-postgresql-0/charm
37263 snap restart charmed-postgresql.patroni
37272 systemctl stop snap.charmed-postgresql.patroni.service
Connection to 192.168.151.112 closed.
^^^ no postgresql process.
initial status
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:40:11Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/2 active idle 2 192.168.151.114 5432/tcp
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.113 machine-3 ubuntu@22.04 default Deployed
2 started 192.168.151.114 machine-4 ubuntu@22.04 default Deployed
$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0 | 192.168.151.112 | Leader | running | 1 | |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming | 1 | 0 |
| + postgresql-2 | 192.168.151.114 | Replica | streaming | 1 | 0 |
+----------------+-----------------+--------------+-----------+----+-----------+
after taking down postgresql-2 (non leader)
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:47:59Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/2 unknown lost 2 192.168.151.114 5432/tcp agent lost, see 'juju show-status-log postgresql/2'
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.113 machine-3 ubuntu@22.04 default Deployed
2 down 192.168.151.114 machine-4 ubuntu@22.04 default Deployed
$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0 | 192.168.151.112 | Leader | running | 1 | |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming | 1 | 0 |
+----------------+-----------------+--------------+-----------+----+-----------+
-> expected status
cleaning up postgresql/2 from the model
remove-machine --force
was used instead of remove-unit
since the machine/unit agent is no longer responding after the hardware failure.
$ juju remove-machine --force 2
WARNING This command will perform the following actions:
will remove machine 2
- will remove unit postgresql/2
- will remove storage pgdata/2
Continue [y/N]? y
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:49:43Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.113 machine-3 ubuntu@22.04 default Deployed
adding another machine as the 3rd node (postgresql/3) in the cluster
$ juju add-unit postgresql
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:58:54Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/3 active idle 3 192.168.151.115 5432/tcp
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.113 machine-3 ubuntu@22.04 default Deployed
3 started 192.168.151.115 machine-4 ubuntu@22.04 default Deployed
$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0 | 192.168.151.112 | Leader | running | 1 | |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming | 1 | 0 |
| + postgresql-3 | 192.168.151.115 | Replica | streaming | 1 | 0 |
+----------------+-----------------+--------------+-----------+----+-----------+
-> expected status
taking down postgresql/3
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 06:18:48Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/3 unknown lost 3 192.168.151.115 5432/tcp agent lost, see 'juju show-status-log postgresql/3'
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.113 machine-3 ubuntu@22.04 default Deployed
3 down 192.168.151.115 machine-4 ubuntu@22.04 default Deployed
The cluster should still work at this point since there are two living nodes out of the 3-node cluster. However, no Patroni operation is possible any longer.
$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
2024-08-06 06:24:29,290 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:24:34,357 - INFO - waiting on raft
2024-08-06 06:24:39,358 - INFO - waiting on raft
2024-08-06 06:24:44,358 - INFO - waiting on raft
2024-08-06 06:24:49,359 - INFO - waiting on raft
2024-08-06 06:24:54,359 - INFO - waiting on raft
2024-08-06 06:24:59,359 - INFO - waiting on raft
2024-08-06 06:25:04,360 - INFO - waiting on raft
^C
Aborted!
[/var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml]
raft:
data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
self_addr: '192.168.151.113:2222'
partner_addrs:
- 192.168.151.115:2222
- 192.168.151.112:2222
The raft config in patroni.yaml looks okay though.
Versions
Operating system: jammy
Juju CLI: 3.5.3-genericlinux-amd64
Juju agent: 3.5.3
Charm revision: 14/stable 429
LXD: N/A
Log output
Juju debug log:
postgresql_replacing_failed_nodes_debug.log
Additional context
It is the same pySyncObj Raft library as described in #571 (comment)
Duplicate of #418, we are trying to fix this in https://warthogs.atlassian.net/browse/DPE-3684