postgresml/pgcat

Custom health checks to automatically ban delayed replicas

Closed this issue · 2 comments

We use pgcat to front a set of read replicas. If one has significant replication delay, it leads to degraded user experience as queries hitting it have stale data. While we can manually ban and unban servers in this state, it would be nice if pgcat handled it for us.

I could see adding a specific control for maximum allowed replication delay, however that seems overly-specific. A custom health check query and health check period would be more general. This query would be expected to return a single boolean as its result.

This would use the existing ban mechanism, and so the ban list is cleared as usual if all replicas are failing the custom health check, ie, are overly behind the primary.

Would the maintainers accept a patch adding such a feature? If so, which of the following approaches is most palatable? (Config names can be bikeshedded independently! 🙂)

  1. add a healthcheck_query that allows specifying a custom query to use instead of ;
  2. custom_healthceck_query, custom_healthcheck_timeout and possibly other custom_ equivalents of the healtcheck_* options; existing ; query healthchecks are left unchanged
  3. maximum_replication_delay, replication_delay_check_interval to handle just this specific use case

My personal leaning is towards (2) for generality, and since the ; query is useful as an especially cheap way to determine if the socket is still connected and can run more frequently than the custom query.

Hey,

This is an interesting idea and I did encounter use cases where I needed this. However, I feel like it would be somewhat difficult to make this generic to cover all use cases and will add a quite a bit of complexity to PgCat.

I am wondering if you could instead have a side process live within PgCat's container that reads the config file and as such can talk to all the instances that PgCat talks to and can use the BAN/UNBAN commands to control routing decision.

That side process could be written in any language and can be customized in whichever ways you want. For example, you can make complex decisions like checking if only one replica is lagging or more than one and make decisions based on that (e.g instead of banning all replicas that are lagging, ban the worst two).

@drdrsh thanks for the quick response. A side process calling BAN/UNBAN was what I meant by "manually" in the initial message. I wanted to get a sense for whether you thought this was a general enough issue to make sense to address inside pgcat.

It sounds like that's not the case, so I'm going to close this :-)