sysown/proxysql

"max_lag_ms" of mysql_aws_aurora_hostgroups doesn't work

xykhappy opened this issue · 3 comments

As described in the document, ProxySQL should be able to change the status of backend servers automatically based on the detected replica lag.

max_lag_ms : replicas that have a replication greater or equal than max_lag_ms milliseconds are automatically disabled from the cluster until their lag returns below the configured threshold.

However it seems like the status change only happens once when mysql_aws_aurora_hostgroups got updated. The status never gets adjusted again regardless of the newly detected lag.

After looking into the codes of ProxySQL, I think there should be a code bug in MySQL_Monitor::evaluate_aws_aurora_results of MySQL_Monitor.cpp as shown below:

At line 6370
unsigned int current_lag_ms = estimate_lag(hse->server_id, lasts_ase, ase_idx, add_lag_ms, min_lag_ms, lag_num_checks);
hse->estimated_lag_ms = current_lag_ms;
if (current_lag_ms > max_latency_ms) {
	enable = false;
}
......
				unsigned int prev_lag_ms = estimate_lag(hse->server_id, lasts_ase, ase_idx, add_lag_ms, min_lag_ms, lag_num_checks);
				if (prev_lag_ms > max_latency_ms) {
					prev_enabled = false;
				}
				if (prev_enabled == enable) {
					// the previous status should be the same
					// do not run any action
					run_action = false;
				}

void MySQL_Monitor::evaluate_aws_aurora_results(unsigned int wHG, unsigned int rHG, AWS_Aurora_status_entry **lasts_ase, unsigned int ase_idx, unsigned int max_latency_ms, unsigned int add_lag_ms, unsigned int min_lag_ms, unsigned int lag_num_checks) {

According to the code pieces, the "ase_idx" used for "estimate_lag" is not changed between the two calls, which results "current_lag_ms" and "prev_lag_ms" the same all the time. So it always falls into "prev_enabled == enable" true and takes no action.

After comparing to previous version, I tried to change the second "ase_idx" to "prev_ase_idx" and tested with the new build. As expected, everything worked well!

Please help check if the conclusion above is correct or not. This is really important to our project.

BTW, the explanation of "max_lag_ms" needs to be corrected as:
max_lag_ms : replicas that have a replication greater or equal than max_lag_ms milliseconds are automatically disabled from the cluster until their lag returns below or equal to the configured threshold.

if (current_lag_ms > max_latency_ms) {

Thanks,
Yuankai

Hello @xykhappy,

first, thanks for the detailed report. This was indeed an issue, and was fixed by this commit as part of this PR. That mentioned PR has been recently merged as part of this bigger PR. So this issue is already fixed in the current development branch, and of course will be part of the next release.

Thank you, Javier.

@JavierJF Great to know!
And may I know what's the rough schedule for next release? I have checked the release history and the cadence was like once per 1~2 months. However it hasn't been updated since last August somehow.

@JavierJF And do you think if it's safe to cherry-pick this commit to v2.5.5 and build a local binary for production usage?