Shunned writer node not becoming online

Question

Shunned writer node not becoming online

Samuele-Z opened this issue 2 months ago · 8 comments

I configured a mariadb/galera cluster with proxysql 2.4.8 on RockyLinux 9.3 following these two tutorials:
https://proxysql.com/blog/effortlessly-scaling-out-galera-cluster-with-proxysql/ (only the part "Configuring the Galera Cluster") and
https://severalnines.com/blog/how-run-and-configure-proxysql-20-mysql-galera-cluster-docker/ wich is similar and more complete.

I need only 1 node writer at a time, the other 2 are only for backup.

It works well

Admin> select hostgroup,srv_host,status,ConnUsed,MaxConnUsed,Queries,Latency_us from stats.stats_mysql_connection_pool;
+-----------+---------------+---------+----------+-------------+---------+------------+
| hostgroup | srv_host      | status  | ConnUsed | MaxConnUsed | Queries | Latency_us |
+-----------+---------------+---------+----------+-------------+---------+------------+
| 0         | 10.200.20.106 | ONLINE  | 1        | 1           | 8       | 155        |
| 0         | 10.200.20.104 | SHUNNED | 0        | 0           | 0       | 88         |
| 0         | 10.200.20.105 | SHUNNED | 0        | 1           | 5       | 208        |
| 1         | 10.200.20.106 | ONLINE  | 0        | 0           | 0       | 155        |
| 1         | 10.200.20.104 | ONLINE  | 0        | 1           | 13      | 88         |
| 1         | 10.200.20.105 | ONLINE  | 0        | 0           | 0       | 208        |
| 2         | 10.200.20.104 | ONLINE  | 0        | 0           | 0       | 88         |
| 2         | 10.200.20.105 | ONLINE  | 0        | 0           | 0       | 208        |
+-----------+---------------+---------+----------+-------------+---------+------------+

until I shutdown the active node. It is correctly marked SHUNNED:

Admin> select hostgroup,srv_host,status,ConnUsed,MaxConnUsed,Queries,Latency_us from stats.stats_mysql_connection_pool;
+-----------+---------------+---------+----------+-------------+---------+------------+
| hostgroup | srv_host      | status  | ConnUsed | MaxConnUsed | Queries | Latency_us |
+-----------+---------------+---------+----------+-------------+---------+------------+
| 0         | 10.200.20.106 | SHUNNED | 0        | 1           | 8       | 213        |
| 0         | 10.200.20.104 | SHUNNED | 0        | 0           | 0       | 145        |
| 0         | 10.200.20.105 | SHUNNED | 0        | 1           | 5       | 211        |
| 1         | 10.200.20.106 | SHUNNED | 0        | 0           | 0       | 213        |
| 1         | 10.200.20.104 | ONLINE  | 0        | 1           | 13      | 145        |
| 1         | 10.200.20.105 | ONLINE  | 0        | 0           | 0       | 211        |
| 2         | 10.200.20.104 | ONLINE  | 0        | 0           | 0       | 145        |
| 2         | 10.200.20.105 | ONLINE  | 0        | 0           | 0       | 211        |
+-----------+---------------+---------+----------+-------------+---------+------------+

but none of the others node become ONLINE.
The situation remain the same (for hours) until i run this quey:

Admin> SELECT * FROM runtime_mysql_servers;
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | 10.200.20.104 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 0            | 10.200.20.105 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 0            | 10.200.20.106 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.104 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.105 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.106 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | 10.200.20.104 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | 10.200.20.105 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

The first time output is the same as previous, but the node become ONLINE, if I run the same query immediately I can see it:

Admin> SELECT * FROM runtime_mysql_servers;
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | 10.200.20.104 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 0            | 10.200.20.105 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 0            | 10.200.20.106 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.104 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.105 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.106 | 3307 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | 10.200.20.104 | 3307 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

It seems that execution of this select trigger a status update inside proxysql.
I tried this several times, with same results.

At the moment the solution I found is to schedule a cron job that execute this query, but I want a more clean fix.

Thanks.

Answer 1 · 2024-05-09T15:53:57.000Z

Hi @Samuele-Z .

This is expected behavior.
ProxySQL will not bring the server ONLINE if it has no reason to use that server.
In other words, if you use proxysql, it will bring the server back ONLINE; if you don't use proxyql, the server will stay SHUNNED

Answer 2 · 2024-05-10T07:15:29.000Z

Hi @Samuele-Z .

This is expected behavior. ProxySQL will not bring the server ONLINE if it has no reason to use that server. In other words, if you use proxysql, it will bring the server back ONLINE; if you don't use proxyql, the server will stay SHUNNED

Wait a moment.
I have a client application doing query on database server throught proxysql. These query return error until i connect to ProxySQL admin console and execute the select on runtime_mysql_servers. If i do some one else query happens nothing. Only querying specific table runtime_mysql_servers trigger update status.

Answer 3 · 2024-05-10T10:11:28.000Z

Can you please provide a reproducible test case?
What you describe shouldn't happen.
A client using proxysql will bring status back to ONLINE.

And as usual: full error log is very important . Thanks

Answer 4 · 2024-05-10T16:27:11.000Z

I have a galera cluster of 3 nodes. On each node I installed proxysql in cluster configuration. I followed tutorials on my first post. I want only one node active at a time doing all writes and all reads. No need to split query.

I attach here full log (i omit only some repated check from load balancer). Start reading at 12:21 (lines before are if you need it)

Log.txt

Answer 5 · 2024-05-14T18:01:23.000Z

Hi @Samuele-Z ,

Could you please share the following configuration?

SELECT * FROM global_variables WHERE variable_name LIKE '%monitor_%' AND variable_name NOT IN ('mysql-monitor_username', 'mysql-monitor_password');
SELECT * FROM mysql_servers;
SELECT * FROM mysql_galera_hostgroups;

This should give us further info for the investigation.

Thanks, Javier.

Answer 6 · 2024-05-16T07:28:24.000Z

Hi @Samuele-Z ,

Could you please share the following configuration?

Hi, here the results:

SELECT * FROM global_variables WHERE variable_name LIKE '%monitor_%' AND variable_name NOT IN ('mysql-monitor_username', 'mysql-monitor_password');
+----------------------------------------------------------------------+----------------+
| variable_name                                                        | variable_value |
+----------------------------------------------------------------------+----------------+
| mysql-monitor_enabled                                                | true           |
| mysql-monitor_connect_timeout                                        | 600            |
| mysql-monitor_ping_max_failures                                      | 3              |
| mysql-monitor_ping_timeout                                           | 1000           |
| mysql-monitor_read_only_max_timeout_count                            | 3              |
| mysql-monitor_replication_lag_group_by_host                          | false          |
| mysql-monitor_replication_lag_interval                               | 10000          |
| mysql-monitor_replication_lag_timeout                                | 1000           |
| mysql-monitor_replication_lag_count                                  | 1              |
| mysql-monitor_groupreplication_healthcheck_interval                  | 5000           |
| mysql-monitor_groupreplication_healthcheck_timeout                   | 800            |
| mysql-monitor_groupreplication_healthcheck_max_timeout_count         | 3              |
| mysql-monitor_groupreplication_max_transactions_behind_count         | 3              |
| mysql-monitor_groupreplication_max_transactions_behind_for_read_only | 1              |
| mysql-monitor_galera_healthcheck_interval                            | 5000           |
| mysql-monitor_galera_healthcheck_timeout                             | 800            |
| mysql-monitor_galera_healthcheck_max_timeout_count                   | 3              |
| mysql-monitor_replication_lag_use_percona_heartbeat                  |                |
| mysql-monitor_query_interval                                         | 60000          |
| mysql-monitor_query_timeout                                          | 100            |
| mysql-monitor_slave_lag_when_null                                    | 60             |
| mysql-monitor_threads_min                                            | 8              |
| mysql-monitor_threads_max                                            | 128            |
| mysql-monitor_threads_queue_maxsize                                  | 128            |
| mysql-monitor_local_dns_cache_ttl                                    | 300000         |
| mysql-monitor_local_dns_cache_refresh_interval                       | 60000          |
| mysql-monitor_local_dns_resolver_queue_maxsize                       | 128            |
| mysql-monitor_wait_timeout                                           | true           |
| mysql-monitor_writer_is_also_reader                                  | true           |
| mysql-monitor_history                                                | 600000         |
| mysql-monitor_connect_interval                                       | 2000           |
| mysql-monitor_ping_interval                                          | 2000           |
| mysql-monitor_read_only_interval                                     | 2000           |
| mysql-monitor_read_only_timeout                                      | 500            |
+----------------------------------------------------------------------+----------------+
34 rows in set (0.002 sec)
Admin> SELECT * FROM mysql_servers;
+--------------+---------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | 10.200.20.104 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 0            | 10.200.20.105 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 0            | 10.200.20.106 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.104 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.105 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 1            | 10.200.20.106 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | 10.200.20.104 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | 10.200.20.105 | 3307 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+---------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
8 rows in set (0.001 sec)

Admin> SELECT * FROM mysql_galera_hostgroups;
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
| writer_hostgroup | backup_writer_hostgroup | reader_hostgroup | offline_hostgroup | active | max_writers | writer_is_also_reader | max_transactions_behind | comment |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
| 0                | 2                       | 1                | 4                 | 1      | 1           | 1                     | 100                     | NULL    |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
1 row in set (0.001 sec)

Thanks.

Answer 7 · 2024-05-20T08:59:15.000Z

Hi @Samuele-Z,

this is an issue with your configuration, let me elaborate. ProxySQL only perform monitoring actions on servers that are responsive to ping, once a server is not responsive, no further monitoring actions take place on the server:

2024-05-08 12:35:15 MySQL_Monitor.cpp:2890:monitor_ping(): [ERROR] Server 10.200.20.196:3307 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.

This is determined by your monitoring ping configuration:

SET mysql-monitor_ping_max_failures=3;
SET mysql-monitor_ping_timeout=1000;
SET mysql-monitor_ping_interval=2000;

The monitoring of the cluster, is controlled by the following configuration variables:

SET mysql-monitor_galera_healthcheck_interval=5000;
SET mysql-monitor_galera_healthcheck_timeout=800;
SET mysql-monitor_galera_healthcheck_max_timeout_count=3;

Seeing the variables side by side, and with the previous context, the problem is more obvious. Your ping interval is too low compared with your galera_healthcheck. This means that before Monitor is going to perform any action on that server, the monitoring itself is going to be disabled for it. You may experience that sometimes the server does indeed get moved as ONLINE, that's because the Galera monitoring check will match exactly with the server shutdown, and another error than a timeout will be detected. For fixing the issue, you just need to ensure that galera_healthcheck_interval * galera_healtcheck_max_timeout_count is smaller than the server ping_max_failures * ping_interval. This way you will ensure that a monitoring action is performed on the server before monitoring gets disabled for it.

SELECT * FROM runtime_mysql_servers

I wanted to clarify also the effect that you see when you select on runtime_mysql_servers. This is just a side effect of the table rebuild. The table runtime_mysql_servers isn't a table that is constantly rebuild, it's build on demand, whenever is required due to monitoring actions or due to user interaction. Temporary server status like SHUNNED are not propagated to this table, until it's rebuilded by one of the previous actions, this is also expected behavior. The Galera implementation from the monitoring module, makes use of this table for performing the server placement, and due to your previous configuration, the table isn't change at all because no monitoring actions are performed. So, Monitor isn't aware of the server status until the query rebuilds the table, the very next monitoring action detects the missing (SHUNNED) server and replaces the writer.

This interaction is likely to change in the future like it changed already for other cluster monitoring solutions.

Hope this helps to clarify the issue. Thanks, Javier.

Answer 8 · 2024-05-20T12:09:44.000Z

Hi @JavierJF thanks for your reply. I changed value of mysql-monitor_galera_healthcheck_interval to 1500 and now it works.