postgresml/pgcat

Automatic discovery of Primary and Replica nodes

Opened this issue ยท 12 comments

Is your feature request related to a problem? Please describe.
When a primary node failover on PostgreSQL cluster specifically Patroni in my case, pgcat needs to be configured again to correctly route queries.

[pools.simple_db]
pool_mode = "transaction"
default_role = "any"
query_parser_enabled = true
primary_reads_enabled = false
sharding_function = "pg_bigint_hash"

[pools.simple_db.users.0]
username = "dba"
password = "password"
pool_size = 100
statement_timeout = 0

[pools.simple_db.shards.0]
servers = [
    [ "10.85.246.169", 5432, "replica"],
    [ "10.85.241.153", 5432, "primary"], # this node failover and another node becomes primary
    [ "10.85.245.26", 5432, "replica"]
]
database = "simple_db"

When this happens, it bans that node correctly, but it can not detect another node that becomes primary on its own. As a result of this, write queries get errors.

[2022-11-10T12:59:00.125226Z ERROR pgcat::pool] Banning instance Address { id: 1, host: "10.85.241.153", port: 5432, shard: 0, database: "simple_db", role: Primary, replica_number: 1, address_index: 1, username: "dba", pool_name: "simple_db" }, error: TimedOut
[2022-11-10T12:59:00.125272Z ERROR pgcat::pool] Banning Address { id: 1, host: "10.85.241.153", port: 5432, shard: 0, database: "simple_db", role: Primary, replica_number: 1, address_index: 1, username: "dba", pool_name: "simple_db" }
[2022-11-10T12:59:00.125340Z ERROR pgcat::client] Could not get connection from pool: { pool_name: "simple_db", username: "dba", shard: 0, role: "Some(Primary)", error: "AllServersDown" }

Describe the solution you'd like
I think this behaviour can be improved by checking nodes with a query like this to see if they are primary or replica.

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

If this query returns true pgcat can mark that node replica, if it's false then it knows that's a primary node.

Describe alternatives you've considered
Another alternative can be creating a test table when pgcat starts, and it can run INSERT/DELETE/UPDATE on that table to see if it runs successfully.

What do you think? Is this a viable solution to this problem?

levkk commented

I agree, I was thinking of implementing the same thing. Another approach could be to patch Patroni to update PgCat's config when something changes in the cluster.

Generally I think config-driven routing is better than probing nodes for information because configs are changed and verified by some system who's entire job is to make sure configs are correct. Meanwhile, someone can just remove standby.signal by accident from a replica and promote it, and we'll have a split-brain situation or a config validation failure because PgCat does not accept multiple primaries in one shard.

I think overall though, in most situations, this feature is likely to do more good than harm, so we should implement it but put it behind a config, e.g. automatic_role_discovery = true and set it to false by default for the time being until we understand all corner cases better.

hi,
mark "primary" is write-able with read
, and 'replica' for read only ?

hi, mark "primary" is write-able with read , and 'replica' for read only ?

Yes, primary is a leader node with both reads and writes. Replica is for read only.

Patroni definitely does a lot of great stuff; however, would it be within scope to just replace Patroni with something like this? The primary mechanism with pgcat seems to be a bit more narrow in scope. It doesn't appear as though the primary can be dynamically changed in this system, instead such would require a manual โ€” or at least an externally driven โ€” reconfig.

Thoughts?

levkk commented

At the moment, that's true, but we can make it dynamic. Patroni does more things to the actual DBs, like reconfiguration after failover, so pgcat can't replace Patroni yet, and might be out of scope at the moment.

There is benefit to having a unified tool that handles Postgres production deployments as a whole, in the future.

Got the same issue with pg-auto-failover
select pg_is_in_recovery(); => t on primary
select pg_is_in_recovery(); => f on secondaries

rim99 commented

Patroni uses confd for proxy config patching, https://github.com/zalando/patroni/tree/master/extras#confd
I think pgcat can maintain a set of configs of its own

Got the same issue with pg-auto-failover
select pg_is_in_recovery(); => t on primary
select pg_is_in_recovery(); => f on secondaries

I'm also using pg-auto-failover, so it would be great to make it easier to mark the new primary automatically, and then at least offer some notification capability.

pg_autofailover here too

Hi @levkk
Consider I have postgres running on A machine and I have B which is a replica. Now I want to do failover from A to B without downtime or data loss, is it possible to use pgCat or other tools?
Thanks

can someone found any solution for that issue? I'm getting same issue.

This would be a great feature to have. We are not using Patroni but this feature would be perfect for our use case.