The cluster stops functioning after replacing a failed unit then another unit failed

Question

The cluster stops functioning after replacing a failed unit then another unit failed

Opened this issue 2 months ago · 2 comments

nobuto-m commented 2 months ago

Steps to reproduce

prepare a MAAS provider
deploy a 3-node cluster
juju deploy postgresql --base ubuntu@22.04 --channel 14/stable -n 3
take down one unit by simulating a hardware failure
remove the failed unit from the model
add another unit to have 3 nodes again
take down another unit by simulating a hardware failure

Expected behavior

The cluster keeps working since there are two living nodes out of 3 (the quorum should be satisfied).

Actual behavior

The cluster is not operational.

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-06 06:12:20,353 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:12:25,425 - INFO - waiting on raft
2024-08-06 06:12:30,425 - INFO - waiting on raft
2024-08-06 06:12:35,426 - INFO - waiting on raft
2024-08-06 06:12:40,426 - INFO - waiting on raft
^C
Aborted!

The charm states postgresql/0 is the primary but it's not true since there is no postgresql process running in postgresql/0 any longer.

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  06:17:39Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    2/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/3   unknown   lost   3        192.168.151.115  5432/tcp  agent lost, see 'juju show-status-log postgresql/3'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  ubuntu@22.04  default  Deployed
1        started  192.168.151.113  machine-3  ubuntu@22.04  default  Deployed
3        down     192.168.151.115  machine-4  ubuntu@22.04  default  Deployed

$ juju run postgresql/leader get-primary
Running operation 7 with 1 task
  - task 8 on unit-postgresql-0

Waiting for task 8...
primary: postgresql/0

$ juju ssh postgresql/0 -- pgrep -af postgres
7638 /snap/charmed-postgresql/115/usr/bin/prometheus-postgres-exporter
36668 python3 /snap/charmed-postgresql/115/usr/bin/patroni /var/snap/charmed-postgresql/115/etc/patroni/patroni.yaml
36920 /usr/bin/python3 src/cluster_topology_observer.py http://192.168.151.112:8008 True /usr/bin/juju-exec postgresql/0 /var/lib/juju/agents/unit-postgresql-0/charm
37263 snap restart charmed-postgresql.patroni
37272 systemctl stop snap.charmed-postgresql.patroni.service
Connection to 192.168.151.112 closed.

^^^ no postgresql process.

initial status

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:40:11Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/2   active    idle   2        192.168.151.114  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  ubuntu@22.04  default  Deployed
1        started  192.168.151.113  machine-3  ubuntu@22.04  default  Deployed
2        started  192.168.151.114  machine-4  ubuntu@22.04  default  Deployed

$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0   | 192.168.151.112 | Leader       | running   |  1 |           |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming |  1 |         0 |
| + postgresql-2 | 192.168.151.114 | Replica      | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

after taking down postgresql-2 (non leader)

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:47:59Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    2/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/2   unknown   lost   2        192.168.151.114  5432/tcp  agent lost, see 'juju show-status-log postgresql/2'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  ubuntu@22.04  default  Deployed
1        started  192.168.151.113  machine-3  ubuntu@22.04  default  Deployed
2        down     192.168.151.114  machine-4  ubuntu@22.04  default  Deployed

$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0   | 192.168.151.112 | Leader       | running   |  1 |           |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

-> expected status

cleaning up postgresql/2 from the model

remove-machine --force was used instead of remove-unit since the machine/unit agent is no longer responding after the hardware failure.

$ juju remove-machine --force 2
WARNING This command will perform the following actions:
will remove machine 2
- will remove unit postgresql/2
- will remove storage pgdata/2

Continue [y/N]? y

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:49:43Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      2  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  ubuntu@22.04  default  Deployed
1        started  192.168.151.113  machine-3  ubuntu@22.04  default  Deployed

adding another machine as the 3rd node (postgresql/3) in the cluster

$ juju add-unit postgresql

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:58:54Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/3   active    idle   3        192.168.151.115  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  ubuntu@22.04  default  Deployed
1        started  192.168.151.113  machine-3  ubuntu@22.04  default  Deployed
3        started  192.168.151.115  machine-4  ubuntu@22.04  default  Deployed

$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0   | 192.168.151.112 | Leader       | running   |  1 |           |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming |  1 |         0 |
| + postgresql-3 | 192.168.151.115 | Replica      | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

-> expected status

taking down postgresql/3

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  06:18:48Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    2/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/3   unknown   lost   3        192.168.151.115  5432/tcp  agent lost, see 'juju show-status-log postgresql/3'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  ubuntu@22.04  default  Deployed
1        started  192.168.151.113  machine-3  ubuntu@22.04  default  Deployed
3        down     192.168.151.115  machine-4  ubuntu@22.04  default  Deployed

The cluster should still work at this point since there are two living nodes out of the 3-node cluster. However, no Patroni operation is possible any longer.

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
2024-08-06 06:24:29,290 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:24:34,357 - INFO - waiting on raft
2024-08-06 06:24:39,358 - INFO - waiting on raft
2024-08-06 06:24:44,358 - INFO - waiting on raft
2024-08-06 06:24:49,359 - INFO - waiting on raft
2024-08-06 06:24:54,359 - INFO - waiting on raft
2024-08-06 06:24:59,359 - INFO - waiting on raft
2024-08-06 06:25:04,360 - INFO - waiting on raft
^C
Aborted!

[/var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml]

raft:
  data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
  self_addr: '192.168.151.113:2222'
  partner_addrs:
  - 192.168.151.115:2222
  - 192.168.151.112:2222

The raft config in patroni.yaml looks okay though.

Versions

Operating system: jammy

Juju CLI: 3.5.3-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 14/stable 429

LXD: N/A

Log output

Juju debug log:

postgresql_replacing_failed_nodes_debug.log

Additional context

Answer 1 · 2024-08-06T06:27:48.000Z

https://warthogs.atlassian.net/browse/DPE-5049

Answer 2 · 2024-08-21T10:25:54.000Z

It is the same pySyncObj Raft library as described in #571 (comment)

Duplicate of #418, we are trying to fix this in https://warthogs.atlassian.net/browse/DPE-3684