planetscale/vitess-operator

Should Vtorc step away when operator is updating massive vttablet config?

Opened this issue · 1 comments

We run a big Vitess (vitess_version: v13.0.0) cluster with the operator (operator:v2.6.0), with semi_sync enabled and with NO --restore_from_backup flag.

When we do config change to vttablet, the operator will update K8S objects (including Pod of course) and that triggers massive Pods going to restart, under this situation, we notice many shards ended up in bad status, which shows in vtgate dashboard, including:

Symptom 1: Broken shards shown on vtgate dashboard

  • All red with no PRIMARY
  • All red REPLICAs with 1 green PRIMARY
  • More than 1 green PRIMARY

Symptom 2: Last_IO_Error or Last_SQL_Errno

High chance to see in MySQL complaining: (our current Vitess cluster with semi_sync enabled)
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID sets and the missing purged transactions are too long to print in this message. For more information, please see the master's error log or the manual for GTID_SUBTRACT.'

My bold guess is when all vttablets were killed in a very short time, Vtorc failed to query and insert records in table _vt.reparent_journal, which would lead to vttablet configure MySQL replication but with a staled binlog record?

In this sense, it would make sense to see Last_SQL_Errno (with semi_sync disabled cluster we were running before)
Last_SQL_Errno: 1062 Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '2ee01d56-d6c6-11ec-8ba0-4ec0a33868ce:13558540' at master log , end_log_pos 6220261. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

Symptom 3: Chaos replications on Vtorc web page

  • Easy to find orphaned clusters on Vtorc's clusters page.
  • No replication relationship on Vtorc's web page.
  • Chained replication relationship rather than Star replication, with all replications are broken (no IO_THREAD running)

We can easily reproduce it, it looks like Vtorc should step away when the vttablet is in a config upgrading progress.

Some great hints from @GuptaManan100 , quote here:

What is breaking the flow is that VTOrc assumes that it is the sole actor after acquiring a shard lock, but the operator isn’t acquiring it before restarting the vttablets.
There are 2 possible fixes here -
Vtop acquires the lock before any vttablet restarts and releases it only after the restart is successful, in this case, VTOrc won’t take any actions before it sees all the changes.
In the roadmap of VTOrc, we have a task to add an API to disableGlobalRecoveries in VTOrc temporarily, and turn it back on later. That can be employed in this case, either vtop doing it, or manually before rolling any upgrades

Wondering what if the operator has an option to configure to kill Pod sequentially not all at once in configurable minutes, that would help in this case while we can keep Vtorc running aside.