MariaDB crashing in submit13 after stop/start

Question

MariaDB crashing in submit13 after stop/start

Closed this issue 4 months ago · 8 comments

Impact of the bug
MariaDB in submit13

Describe the bug
As it has been reported during the past days and today in the WMCore meeting, the FNAL agents were having issues and slowly progressing through the workload.
Based on that and some reports of "lock wait timeout" in a few components, I decided to:

stop wmagent components
stop wmagent container
stop mariadb container (the docker stop command actually timed out, hence the container was forcebly killed)
started mariadb container (which failed to connect to localhost)
and everything else went downhill!

After a few attempts, MariaDB managed to be started from inside the container and I can access the database prompt and run queries.
However, after starting WMAgent, I see that every single component crashed while communicating with the RDMBS.

How to reproduce it
Not sure

Expected behavior
The database to be fully operational and keeping up with the load that we have always executed.

Additional context and error message
Full error log path is /data/srv/mariadb/current/logs/error.log.

Inside this log, we can see the moment that the service was stopped and its logs (everything looks normal to me):

2024-10-07 17:50:36 0 [Note] /usr/sbin/mariadbd (initiated by: xxx[xxx] @ localhost []): Normal shutdown
2024-10-07 17:50:36 0 [Note] InnoDB: FTS optimize thread exiting.
2024-10-07 17:50:36 0 [Note] InnoDB: Starting shutdown...
2024-10-07 17:50:36 0 [Note] InnoDB: Dumping buffer pool(s) to /data/srv/mariadb/10.6.5/install/database/ib_buffer_pool
2024-10-07 17:50:36 0 [Note] InnoDB: Restricted to 129552 pages due to innodb_buf_pool_dump_pct=25
2024-10-07 17:50:36 0 [Note] InnoDB: Buffer pool(s) dump completed at 241007 17:50:36
241007 17:52:26 mysqld_safe Starting mariadbd daemon with databases from /data/srv/mariadb/10.6.5/install/database

Followed by a restart of the mariadb container:

241007 17:52:26 mysqld_safe Starting mariadbd daemon with databases from /data/srv/mariadb/10.6.5/install/database
2024-10-07 17:52:27 0 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=cmsgwms-submit13.fnal.gov' or '--log-bin=mariadbd-bin' to avoid this problem.
2024-10-07 17:52:27 0 [Note] /usr/sbin/mariadbd (server 10.6.5-MariaDB-1:10.6.5+maria~focal-log) starting as process 346 ...
2024-10-07 17:52:27 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-10-07 17:52:27 0 [Note] InnoDB: Number of pools: 1
2024-10-07 17:52:27 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-10-07 17:52:27 0 [Note] InnoDB: Using Linux native AIO
2024-10-07 17:52:27 0 [Note] InnoDB: Initializing buffer pool, total size = 8589934592, chunk size = 134217728
2024-10-07 17:52:27 0 [Note] InnoDB: Completed initialization of buffer pool
2024-10-07 17:52:27 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=290549958163,290550247415
2024-10-07 17:52:38 0 [Note] InnoDB: Starting final batch to recover 82737 pages from redo log.
2024-10-07 17:52:42 0 [Note] InnoDB: To recover: 82385 pages from log
2024-10-07 17:52:57 0 [Note] InnoDB: To recover: 81951 pages from log
2024-10-07 17:53:12 0 [Note] InnoDB: To recover: 81533 pages from log
2024-10-07 17:53:27 0 [Note] InnoDB: To recover: 81378 pages from log
241007 17:53:35 mysqld_safe Starting mariadbd daemon with databases from /data/srv/mariadb/10.6.5/install/database
2024-10-07 17:53:35 0 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=cmsgwms-submit13.fnal.gov' or '--log-bin=mariadbd-bin' to avoid this problem.
2024-10-07 17:53:35 0 [Note] /usr/sbin/mariadbd (server 10.6.5-MariaDB-1:10.6.5+maria~focal-log) starting as process 708 ...
2024-10-07 17:53:35 0 [ERROR] mariadbd: Can't lock aria control file '/data/srv/mariadb/10.6.5/install/database/aria_log_control' for exclusive use, error: 11. Will retry for 30 seconds
2024-10-07 17:53:42 0 [Note] InnoDB: To recover: 64520 pages from log
2024-10-07 17:53:57 0 [Note] InnoDB: To recover: 54171 pages from log
2024-10-07 17:54:05 0 [ERROR] mariadbd: Got error 'Could not get an exclusive lock; file is probably in use by another process' when trying to use aria control file '/data/srv/mariadb/10.6.5/install/database/aria_log_control'
2024-10-07 17:54:05 0 [ERROR] Plugin 'Aria' init function returned error.
2024-10-07 17:54:05 0 [ERROR] Plugin 'Aria' registration as a STORAGE ENGINE failed.
2024-10-07 17:54:05 0 [Note] InnoDB: Compressed tables use zlib 1.2.11

and while I was trying to stop/start the service (without stopping and looking into the logs to see what was going on), we see this in the same log:

2024-10-07 17:57:13 0 [Note] InnoDB: Starting final batch to recover 82737 pages from redo log.
2024-10-07 17:57:25 0 [Note] InnoDB: To recover: 81785 pages from log
2024-10-07 17:57:27 0 [Note] InnoDB: To recover: 82402 pages from log
2024-10-07 17:57:40 0 [Note] InnoDB: To recover: 81512 pages from log
2024-10-07 17:57:42 0 [Note] InnoDB: To recover: 82102 pages from log
2024-10-07 17:57:55 0 [Note] InnoDB: To recover: 81373 pages from log
2024-10-07 17:57:57 0 [Note] InnoDB: To recover: 81676 pages from log
2024-10-07 17:58:00 0 [ERROR] InnoDB: Page [page id: space=0, page number=7197] log sequence number 292068864554 is in the future! Current system log sequence number 292068367911.
2024-10-07 17:58:00 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

Answer 1 · 2024-10-07T21:55:54.000Z

My suspicious is that I was trying to stop/start MariaDB with the manage script, and even though the database was failing to start up:

241007 17:52:26 mysqld_safe Starting mariadbd daemon with databases from /data/srv/mariadb/10.6.5/install/database
mariadb-admin: connect to server at 'localhost' failed
error: 'Can't connect to local server through socket '/var/run/mysqld/mariadb.sock' (2)'
Check that mariadbd is running and that the socket: '/var/run/mysqld/mariadb.sock' exists!

and consequently properly stopping, at some point I tried to run 2 processes for MariaDB, which ended up corrupting the database. I have this hypothesis because of these 2 lines in the original log shared in the description above:

241007 17:52:26 mysqld_safe Starting mariadbd...
241007 17:53:35 mysqld_safe Starting mariadbd...

This MariaDB JIRA ticket:
https://jira.mariadb.org/browse/MDEV-28495

reports a very similar problem (if not the same). According to the bug-fix linked in the last comment of this ticket, the actual bug-fix was only resolved in MariaDB 10.6.15, while we run 10.6.5(!!!)

I will soon look into the recovery documentation and see if we can recover this data.

Answer 2 · 2024-10-08T10:48:33.000Z

I've been trying to get a dump of the database, but it keeps crashing at the same table:

(MariaDB-10.6.5) [xxx@submit13:current]$ mariadb-dump --socket=/var/run/mysqld/mariadb.sock wmagent --single-transaction --verbose > /data/admin/wmagent/mariadb_bkp2/backup-file-2.sql
...
-- Retrieving table structure for table wmbs_sub_files_acquired...
-- Sending SELECT query...
-- Retrieving rows...
mariadb-dump: Error 2013: Lost connection to server during query when dumping table `wmbs_sub_files_acquired` at row: 244986

this has been attempted in normal mode and recovery mode 1, such as:

2024-10-08 10:42:55 0 [Note] InnoDB: !!! innodb_force_recovery is set to 1 !!!

after updating the startup command in the manage script.

According to this documentation:
https://mariadb.com/kb/en/innodb-recovery-modes/
I could try recovery modes up to 3. After that, the recovery is more "aggressive"...

Answer 3 · 2024-10-08T15:09:17.000Z

And MariaDB databases check process didn't manage to complete, as the service crashed after scanning databases/tables for an hour or so:

(MariaDB-10.6.5) [xxx@submit13:current]$ mariadb-check --verbose --socket=/var/run/mysqld/mariadb.sock --all-databases                               
Processing databases
information_schema
mysql
mysql.column_stats                                 OK
...
mysql.transaction_registry                         OK
performance_schema
sys
sys.sys_config                                     OK
wmagent
wmagent.bl_runjob                                  OK
...
wmagent.wmbs_frl_workunit_assoc                    OK
mariadb-check: Got error: 2013: Lost connection to server during query when executing 'CHECK TABLE ... '

The error log says:

2024-10-08 14:46:21 6 [ERROR] InnoDB: In pages [page id: space=27, page number=294730] and [page id: space=27, page number=23462] of index `wmbs_job_unique` of table `wmagent`.`wmbs_job`, index tree level 1
InnoDB: broken FIL_PAGE_NEXT or FIL_PAGE_PREV links
2024-10-08 14:46:21 6 [ERROR] InnoDB: In page 23462 of index `wmbs_job_unique` of table `wmagent`.`wmbs_job`, index tree level 1
2024-10-08 14:46:21 6 [ERROR] InnoDB: Node ptrs differ on levels > 0
InnoDB: node ptr PHYSICAL RECORD: n_fields 5; compact format; info bits 0
 0: len 30; hex 61626662353565312d626337302d346333352d393564352d303331343462; asc abfb55e1-bc70-4c35-95d5-03144b; (total 39 bytes);
 1: SQL NULL;
 2: SQL NULL;
 3: len 4; hex 8064bab4; asc  d  ;;
 4: len 4; hex 00005ba6; asc   [ ;;
InnoDB: first rec PHYSICAL RECORD: n_fields 5; compact format; info bits 0
 0: len 30; hex 61633130626163392d326633392d343165322d623935622d623361663861; asc ac10bac9-2f39-41e2-b95b-b3af8a; (total 36 bytes);
 1: len 30; hex 2f646174612f7372762f776d6167656e742f63757272656e742f696e7374; asc /data/srv/wmagent/current/inst; (total 266 bytes);
 2: len 30; hex 2f646174612f7372762f776d6167656e742f63757272656e742f696e7374; asc /data/srv/wmagent/current/inst; (total 279 bytes);
 3: len 4; hex 80252fb3; asc  %/ ;;
 4: len 4; hex 00065291; asc   R ;;

2024-10-08 14:52:07 6 [ERROR] InnoDB: Corruption of an index tree: table `wmagent`.`wmbs_job` index `wmbs_job_unique`, father ptr page no 50211, child page no 50222
PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 30; hex 34386264353766372d663533652d346330312d613863622d653662633134; asc 48bd57f7-f53e-4c01-a8cb-e6bc14; (total 40 bytes);
 1: len 30; hex 2f646174612f7372762f776d6167656e742f63757272656e742f696e7374; asc /data/srv/wmagent/current/inst; (total 193 bytes);
 2: len 30; hex 2f646174612f7372762f776d6167656e742f63757272656e742f696e7374; asc /data/srv/wmagent/current/inst; (total 206 bytes);
 3: len 4; hex 8086efb2; asc     ;;
2024-10-08 14:52:07 6 [Note] InnoDB: n_owned: 0; heap_no: 2; next rec: 585
PHYSICAL RECORD: n_fields 5; compact format; info bits 0
 0: len 30; hex 34386264353766372d663533652d346330312d613863622d653662633134; asc 48bd57f7-f53e-4c01-a8cb-e6bc14; (total 40 bytes);
 1: len 30; hex 2f646174612f7372762f776d6167656e742f63757272656e742f696e7374; asc /data/srv/wmagent/current/inst; (total 193 bytes);
 2: len 30; hex 2f646174612f7372762f776d6167656e742f63757272656e742f696e7374; asc /data/srv/wmagent/current/inst; (total 206 bytes);
 3: len 4; hex 8086efc1; asc     ;;
 4: len 4; hex 0000c423; asc    #;;
2024-10-08 14:52:07 6 [Note] InnoDB: n_owned: 0; heap_no: 40; next rec: 1181
2024-10-08 14:52:07 6 [ERROR] [FATAL] InnoDB: You should dump + drop + reimport the table to fix the corruption. If the crash happens at database startup. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. Then dump + drop + reimport.
241008 14:52:07 [ERROR] mysqld got signal 6 ;

Answer 4 · 2024-10-09T00:07:08.000Z

I collected a dump of the wmagent database and all the files are available under:

[cmsdataops@cmsgwms-submit13 wmagent]$ ls -lth /data/dockerMount/admin/wmagent/mariadb_table_bkp/
total 19G
-rw-r--r-- 1 xxx xxx  15G Oct  8 19:01 backup-file.sql
-rw-r--r-- 1 xxx xxx 3.1G Oct  8 15:16 wmbs_job.sql
-rw-r--r-- 1 xxx xxx 2.3K Oct  8 15:03 wmbs_workunit.sql
-rw-r--r-- 1 xxx xxx 1.4M Oct  8 15:03 wmbs_workflow_output.sql
-rw-r--r-- 1 xxx xxx  14M Oct  8 15:03 wmbs_workflow.sql
-rw-r--r-- 1 xxx xxx 2.9K Oct  8 15:03 wmbs_users.sql
-rw-r--r-- 1 xxx xxx 3.7M Oct  8 15:03 wmbs_subscription_validation.sql
-rw-r--r-- 1 xxx xxx 1.9M Oct  8 15:02 wmbs_subscription.sql
-rw-r--r-- 1 xxx xxx 2.2K Oct  8 15:02 wmbs_sub_types.sql
-rw-r--r-- 1 xxx xxx  11M Oct  8 15:02 wmbs_sub_files_failed.sql
-rw-r--r-- 1 xxx xxx 470M Oct  8 15:00 wmbs_sub_files_complete.sql
-rw-r--r-- 1 xxx xxx  19M Oct  8 14:50 wmbs_sub_files_available.sql
-rw-r--r-- 1 xxx xxx 4.3M Oct  8 14:10 wmbs_sub_files_acquired.sql

where backup-file.sql contains most of the tables, but given that it was crashing at the table wmbs_sub_files_acquired, I decided to dump table by table for the remaining of the database.

Now I am going to try to load these dumps in a fresh new instnce (testing in submit14), if everything goes well, then I will perform the same.

It is important to mention though that the following 2 tables are crashing MariaDB:

mariadb-dump: Error 2013: Lost connection to server during query when dumping table `wmbs_sub_files_acquired` at row: 244986
mariadb-dump: Error 2013: Lost connection to server during query when dumping table `wmbs_sub_files_available` at row: 1095366

so their dump is likely incomplete, but I fear we will have to move forward with it like this.

Answer 5 · 2024-10-09T13:15:55.000Z

Other than the 2 tables reported above - listing here as well wmbs_sub_files_acquired and wmbs_sub_files_available - which were reported with corrupted data causing MariaDB table dump to be only partial, I believe the remaining of the wmagent tables have been fully dumped (backed-up).

After that, I had to drop the wmagent database in submit13, which led to failures initializing a new database schema. With that, my next step was a manage mysql-clean and the full removal of /data/dockerMount/srv/mariadb.

Next, I redeployed MariaDB, then restarted the WMAgent container, such that it could create the database schema.

Next, I started loading the dump file located at:

[xxx@submit13 ~]$ ls /data/dockerMount/srv/mariadb/current/logs/backup-file.sql

into MariaDB (which is no longer running in recovery mode).

Loading this data took about 2h in total, before starting things up, I decided to run a database check and this time everything is reported OK and no crashes have been seen:

(MariaDB-10.6.5) [xxx@submit13:logs]$ mariadb-check --socket=/var/run/mysqld/mariadb.sock --all-databases
mysql.column_stats                                 OK
...
wmagent.wmbs_workflow_output                       OK
wmagent.wmbs_workunit                              OK
(MariaDB-10.6.5) [xxx@submit13:logs]$

In general, I was following a few of the MariaDB materials and explanations of comments. For the dump/load procedure, this link summarizes this well: https://mariadb.com/kb/en/innodb-recovery-modes/#fixing-things

NOW it is time to try to restart this agent and see how things evolve.

Answer 6 · 2024-10-09T13:43:44.000Z

The agent did start up, there are some components complaining about localhost couchdb timing out, but things are flowing.
I somehow also expect other failures, as the database is definitely missing either sub_available or sub_acquired files, so there might be some jobs (and workflows) with an "unexplainable" outcome.

Answer 7 · 2024-10-15T18:00:36.000Z

The agent is almost drained now, but it required some workflow manipulation, which the P&R team promptly and kindly helped us out. Here is a summary of a few of those:

A few workflows were hitting an unique constraint in WorkQueueManager, e.g.:

Error: (MySQLdb.IntegrityError) (1062, "Duplicate entry 'cmsunified_task_TOP-RunIISummer20UL18wmLHEGEN-00710__v1_T_240...' for key 'uniq_wf_name'")

and the following workflows have been either aborted or force-completed - such that the component can move forward:
cmsunified_task_TOP-RunIISummer20UL18wmLHEGEN-00710__v1_T_240212_095605_5779
haozturk_task_BPH-GenericGSmearS-00004__v1_T_231219_141016_1739
cmsunified_task_TOP-RunIISummer20UL16wmLHEGENAPV-00739__v1_T_240810_102700_6505

ACDCs with no work left in the acdcserver (probably irrelevant, but just logging it here as well):
cmsunified_ACDC2_task_EXO-Run3Summer23wmLHEGS-00828__v1_T_240805_004657_9100
cmsunified_ACDC2_task_GEN-Run3Summer23BPixwmLHEGS-00503__v1_T_240805_002817_3223

Once this agent is fully drained, I think we can close this issue out.

Answer 8 · 2024-10-19T03:42:16.000Z

And the final work in this agent finished today and I just shut it off. With that, I think we can close this issue out.