abg/dbsake

sandbox initialization is racy

Closed this issue · 4 comments

abg commented

To grant a new database user the sandbox command effectively do:

  • create a named temporary file on disk
  • generate an appropriate GRANT DCL and write to this temp file
  • start the sandbox via sandbox.sh with --init-file=${temporary_filename}
  • stop the sandbox and wait for sandbox.sh stop to finish
  • remove the temporary file

The problem here exposes a serious issue with the sandbox.sh start and stop actions. The "start" action looks for the unix socket to appear and then exits successfully, thinking MySQL is started. The "stop" action looks for the MySQL pidfile,finds the pid, sends it a SIGTERM and waits for the associated process to go away. Similarly the "status" action also looks for the pidfile, finds the pid and verifies the associated process is still running (kill -0).

So there is a discrepancy - start looks at "socket", but "stop"/"status" looks at pid-file. This causes problems because the socket shows up early during network initialization. --init-file processing doesn't happen until sometime later. So the start + stop actions done by dbsake here are broken - the sandbox.sh aren't waiting long enough and sandbox is potentially removing the init-file before MySQL can read it.

Effectively the start action needs to be fixed here. The check for the MySQL unix socket to appear is broken and causing --init-file processing to also be broken.

abg commented

FWIW, this may be an issue only on MariaDB/Galera (and possibly PXC). I don't recall ever seeing this in another situation, despite setting up several dozens of sandboxes.

Regardless I think the logic is probably flawed and a better approach is required. And the Galera server case is still important.

I don't know of any filesystem-level checks to ensure that MySQL is online aside from waiting until "ready for connections" is printed to the log file.

Would polling mysqladmin ping until it reports that MySQL is online be a logical choice?

abg commented

Yeah, the RHEL MySQL initscript uses "mysqladmin ping" and sits in a loop retrying periodically up to some timeout value. mysql.com uses a different approach and waits for the pid-file (which is created by mysqld) to be created. I think this probably solves it as well and would be very similar to the current dbsake sandbox socket check logic.

I think Galera is a bit strange since there are basically a two phase startup process handled by mysqld_safe. I gleaned the following understand from reading MariaDB-Galera-Cluster 10.0.17's mysqld_safe:

  • Unconditionally, mysqld_safe runs $mysqld with the --wsrep_recover option and a --log-error=$(mktemp $DATADIR/wsrep_recovery.XXXXXX) path
  • The $log_error location is grepped for "WSREP: Recovered position" and the actual position coordinates are extracted
  • mysqld is restarted with --wsrep_start_position and MySQL then comes online

So mysqld is run twice. This is done in MariaDB 10.0, at least, whether or not wsrep-provider is actually configured. I suspect the logic is very similar in PXC's implementation.

By observing strace output, the server goes through network initialization on both mysqld runs on start. However the pid-file only seems to be created during the normal (sans --wsrep-recover) run.

So I think a very trivial improvement is to just swap the $socket logic for $pid_file with a few minor tweaks - this follows the processes used in Oracle/MySQL's ./support-files/mysql.server.sh. mysqladmin ping may be necessary (or at least, more robust) but I was hoping to avoid it, as it seems more tricky to get right.

abg commented

Resolved in 2.1.1