Custom health checks to automatically ban delayed replicas
Closed this issue · 2 comments
We use pgcat to front a set of read replicas. If one has significant replication delay, it leads to degraded user experience as queries hitting it have stale data. While we can manually ban and unban servers in this state, it would be nice if pgcat handled it for us.
I could see adding a specific control for maximum allowed replication delay, however that seems overly-specific. A custom health check query and health check period would be more general. This query would be expected to return a single boolean
as its result.
This would use the existing ban mechanism, and so the ban list is cleared as usual if all replicas are failing the custom health check, ie, are overly behind the primary.
Would the maintainers accept a patch adding such a feature? If so, which of the following approaches is most palatable? (Config names can be bikeshedded independently! 🙂)
- add a
healthcheck_query
that allows specifying a custom query to use instead of;
custom_healthceck_query
,custom_healthcheck_timeout
and possibly othercustom_
equivalents of thehealtcheck_*
options; existing;
query healthchecks are left unchangedmaximum_replication_delay
,replication_delay_check_interval
to handle just this specific use case
My personal leaning is towards (2) for generality, and since the ;
query is useful as an especially cheap way to determine if the socket is still connected and can run more frequently than the custom query.
Hey,
This is an interesting idea and I did encounter use cases where I needed this. However, I feel like it would be somewhat difficult to make this generic to cover all use cases and will add a quite a bit of complexity to PgCat.
I am wondering if you could instead have a side process live within PgCat's container that reads the config file and as such can talk to all the instances that PgCat talks to and can use the BAN/UNBAN
commands to control routing decision.
That side process could be written in any language and can be customized in whichever ways you want. For example, you can make complex decisions like checking if only one replica is lagging or more than one and make decisions based on that (e.g instead of banning all replicas that are lagging, ban the worst two).
@drdrsh thanks for the quick response. A side process calling BAN
/UNBAN
was what I meant by "manually" in the initial message. I wanted to get a sense for whether you thought this was a general enough issue to make sense to address inside pgcat.
It sounds like that's not the case, so I'm going to close this :-)