maintenance gets into trouble when (asymmetric) backend gets full
Closed this issue · 4 comments
dejonghb commented
Maintenance frequently dumps core when the backend fills up...
[Fri Mar 24 15:51:36 2017] alba[26137]: segfault at 0 ip (null) sp 00007ffe9656b558 error 14 in alba[400000+8d8000]
[Fri Mar 24 15:58:16 2017] alba[1394]: segfault at 299 ip 0000000000a77e04 sp 00007fff2a838830 error 4 in alba[400000+8d8000]
[Fri Mar 24 15:59:13 2017] traps: alba[1762] general protection ip:b19fb3 sp:7ffe4fcace70 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:04:12 2017] traps: alba[1837] general protection ip:a96df1 sp:7ffc2cb44640 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:10:05 2017] traps: alba[2426] general protection ip:4180c2 sp:7ffea884ecf0 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:14:33 2017] traps: alba[2757] general protection ip:b19fb3 sp:7ffe77f7cf10 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:17:46 2017] traps: alba[3114] general protection ip:abb3dd sp:7ffe18656050 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:18:25 2017] traps: alba[3537] general protection ip:b19fb3 sp:7fff35d86370 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:18:54 2017] traps: alba[3810] general protection ip:b1e146 sp:7ffd488731b0 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:23:17 2017] traps: alba[3849] general protection ip:ab562c sp:7ffe0c43b440 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:23:57 2017] traps: alba[4084] general protection ip:abaac6 sp:7ffcae1d3070 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:24:13 2017] alba[4129]: segfault at eff8 ip 0000000000ab562c sp 00007ffef8a13e00 error 4 in alba[400000+8d8000]
[Fri Mar 24 16:26:39 2017] alba[4157]: segfault at 0 ip 0000000000a77e04 sp 00007ffcdef245c0 error 4 in alba[400000+8d8000]
[Fri Mar 24 16:30:41 2017] traps: alba[4304] general protection ip:417f3a sp:7ffec01a2260 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:31:20 2017] traps: alba[4528] general protection ip:a7b7bf sp:7ffdd61b03c8 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:32:30 2017] traps: alba[4567] general protection ip:b19fb3 sp:7ffc0946f1f0 error:0 in alba[400000+8d8000]
In this setup one storage server had 1 extra disk in use, making it an asymmetric setup. I'll retest after removing that extra disk to see if this has an impact or not.
(using alba 1.3.10)
dejonghb commented
After removing the extra disk no core dumps were generated while refilling the backend
Re-adding the extra disk also reintroduces the core dumping (more frequent when also writing new data)
dejonghb commented
root@ftcmp02:~# journalctl -u alba-maintenance_be1-AJe4J5QgWTxi0NjX.service | grep core-dump
Mar 28 10:10:51 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 10:17:08 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 10:17:28 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:28:53 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:33:25 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:36:39 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:37:09 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:38:01 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:39:29 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:44:10 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:44:43 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:47:18 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:54:14 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
root@ftcmp03:~# journalctl -u alba-maintenance_be1-6M2cKnTmOjO7TxSC.service | grep core-dump
Mar 28 09:23:14 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 09:51:40 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 09:52:05 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 10:17:30 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 10:17:49 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:37:29 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:37:43 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:37:58 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:38:32 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:42:10 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:42:42 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:42:56 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:43:12 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:43:28 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:43:47 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:44:07 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:44:38 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:46:22 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:52:11 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:52:25 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:52:39 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:53:16 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:56:31 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:56:50 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:57:08 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:57:24 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:57:42 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:58:18 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:58:38 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:59:00 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
root@ftcmp04:~# journalctl -u alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service | grep core-dump
Mar 28 11:23:05 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Mar 28 11:24:05 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Mar 28 11:25:40 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Mar 28 11:26:26 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Total fill rate:
- ~85% 10:58
- ~90% 11:25
- ~92% 11:35
- ~93% 11:40
- ~94% 11:45 (disks on 1st 2 nodes filled ~98%)
stopped adding data ~11:50
dejonghb commented
coredumps available on 10.100.186.33
root@ftcmp03:~# coredumpctl list
TIME PID UID GID SIG PRESENT EXE
Tue 2017-03-28 09:56:29 CEST 26020 0 0 11 /usr/bin/alba
Tue 2017-03-28 09:56:47 CEST 6304 0 0 11 /usr/bin/alba
Tue 2017-03-28 10:20:06 CEST 18311 0 0 11 /usr/bin/alba
Tue 2017-03-28 11:40:48 CEST 18464 0 0 11 /usr/bin/alba
Tue 2017-03-28 11:41:20 CEST 22791 0 0 11 /usr/bin/alba
Tue 2017-03-28 11:41:43 CEST 22523 0 0 11 /usr/bin/alba
Tue 2017-03-28 11:42:37 CEST 22628 0 0 11 /usr/bin/alba
Tue 2017-03-28 11:46:03 CEST 23071 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:47:50 CEST 25384 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:48:04 CEST 25536 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:48:53 CEST 25725 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:50:50 CEST 26045 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:55:24 CEST 29796 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:55:55 CEST 29945 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:57:11 CEST 29686 0 0 11 * /usr/bin/alba
Tue 2017-03-28 11:59:34 CEST 30221 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:00:37 CEST 31778 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:00:58 CEST 32235 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:01:01 CEST 31930 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:01:19 CEST 32115 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:01:36 CEST 32742 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:02:08 CEST 609 0 0 6 * /usr/bin/alba
Tue 2017-03-28 12:02:09 CEST 423 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:02:20 CEST 32401 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:06:00 CEST 914 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:06:28 CEST 2413 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:06:48 CEST 3097 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:06:51 CEST 2773 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:08:41 CEST 3272 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:12:33 CEST 3467 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:14:16 CEST 5762 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:16:45 CEST 5930 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:18:13 CEST 7589 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:23:17 CEST 7743 0 0 11 * /usr/bin/alba
Tue 2017-03-28 12:26:01 CEST 10868 0 0 11 * /usr/bin/alba
For pid 609 the coredump is also saved to sigabrted
coredumpctl dump 609 > sigabrted
and can be examined via
gdb /usr/bin/alba sigabrted
in case it's rotated away.
Update: most dumps mentioned above are "rotated" away (2017/03/29)