codership/mysql-wsrep

Spurious bf aborts under ps protocol

sciascid opened this issue · 1 comments

Running galera suites with option --ps-protocol may cause some tests to fail with unexpected deadlock errors (which cannot happen when executed with normal protocol).
These deadlock errors are caused by the fact that during COM_STMT_PREPARE processing the target tables of the statement may be opened, and therefore the prepare stage becomes vulnerable to bf aborts triggered by concurrent DDLs.
Also, there is no wsrep sync wait before COM_STMT_PREPARE commands are processed.
All tests that have statements executing concurrently with DDLs, and that rely on sync wait for those statements to not fail are potentially affected by this issue.

A deterministic test has been devised:

--source include/galera_cluster.inc
--source include/have_debug_sync.inc

if (`SELECT $PS_PROTOCOL = 0`)
{
  --skip Test requires: ps-protocol enabled
}

CREATE TABLE t1 (f1 INTEGER PRIMARY KEY, f2 CHAR(6)) ENGINE=InnoDB;

--connection node_1
SET GLOBAL DEBUG = "+d,sync.wsrep_apply_cb";

--connection node_2
OPTIMIZE TABLE t1;

--connection node_1
SET DEBUG_SYNC = "now WAIT_FOR sync.wsrep_apply_cb_reached";

SET DEBUG_SYNC = "stmt_prepare_before_mdl_release SIGNAL signal.wsrep_apply_cb WAIT_FOR bf_abort";
UPDATE t1 SET f2 = 2 WHERE f1 = 1;

And requires a new debug sync point:

diff --git a/sql/sql_prepare.cc b/sql/sql_prepare.cc
index 569499bbc44..c5d4a9ccf3e 100644
--- a/sql/sql_prepare.cc
+++ b/sql/sql_prepare.cc
@@ -128,6 +128,7 @@ When one supplies long data for a placeholder:
 #include <limits>
 using std::max;
 using std::min;
+#include "debug_sync.h"
 
 /**
   A result class used to send cursor rows using the binary protocol.
@@ -3498,6 +3499,7 @@ bool Prepared_statement::prepare(const char *packet, uint packet_len)
   /* No need to commit statement transaction, it's not started. */
   DBUG_ASSERT(thd->transaction.stmt.is_empty());
 
+  DEBUG_SYNC(thd, "stmt_prepare_before_mdl_release");
   close_thread_tables(thd);
   thd->mdl_context.rollback_to_savepoint(mdl_savepoint);

I see two potential solutions:

  • Add sync wait before COM_STMT_PREPARE
  • Make DDLs wait until conflicting COM_STMT_PREPARE commands are executing

This issue no longer reproduces. Sync wait before COM_STMT_PREPARE was added.