openvstorage/alba

maintenance gets into trouble when (asymmetric) backend gets full

Closed this issue · 4 comments

Maintenance frequently dumps core when the backend fills up...

[Fri Mar 24 15:51:36 2017] alba[26137]: segfault at 0 ip           (null) sp 00007ffe9656b558 error 14 in alba[400000+8d8000]
[Fri Mar 24 15:58:16 2017] alba[1394]: segfault at 299 ip 0000000000a77e04 sp 00007fff2a838830 error 4 in alba[400000+8d8000]
[Fri Mar 24 15:59:13 2017] traps: alba[1762] general protection ip:b19fb3 sp:7ffe4fcace70 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:04:12 2017] traps: alba[1837] general protection ip:a96df1 sp:7ffc2cb44640 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:10:05 2017] traps: alba[2426] general protection ip:4180c2 sp:7ffea884ecf0 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:14:33 2017] traps: alba[2757] general protection ip:b19fb3 sp:7ffe77f7cf10 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:17:46 2017] traps: alba[3114] general protection ip:abb3dd sp:7ffe18656050 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:18:25 2017] traps: alba[3537] general protection ip:b19fb3 sp:7fff35d86370 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:18:54 2017] traps: alba[3810] general protection ip:b1e146 sp:7ffd488731b0 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:23:17 2017] traps: alba[3849] general protection ip:ab562c sp:7ffe0c43b440 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:23:57 2017] traps: alba[4084] general protection ip:abaac6 sp:7ffcae1d3070 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:24:13 2017] alba[4129]: segfault at eff8 ip 0000000000ab562c sp 00007ffef8a13e00 error 4 in alba[400000+8d8000]
[Fri Mar 24 16:26:39 2017] alba[4157]: segfault at 0 ip 0000000000a77e04 sp 00007ffcdef245c0 error 4 in alba[400000+8d8000]
[Fri Mar 24 16:30:41 2017] traps: alba[4304] general protection ip:417f3a sp:7ffec01a2260 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:31:20 2017] traps: alba[4528] general protection ip:a7b7bf sp:7ffdd61b03c8 error:0 in alba[400000+8d8000]
[Fri Mar 24 16:32:30 2017] traps: alba[4567] general protection ip:b19fb3 sp:7ffc0946f1f0 error:0 in alba[400000+8d8000]

In this setup one storage server had 1 extra disk in use, making it an asymmetric setup. I'll retest after removing that extra disk to see if this has an impact or not.

(using alba 1.3.10)

After removing the extra disk no core dumps were generated while refilling the backend
Re-adding the extra disk also reintroduces the core dumping (more frequent when also writing new data)

root@ftcmp02:~# journalctl -u alba-maintenance_be1-AJe4J5QgWTxi0NjX.service | grep core-dump
Mar 28 10:10:51 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 10:17:08 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 10:17:28 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:28:53 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:33:25 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:36:39 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:37:09 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:38:01 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:39:29 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:44:10 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:44:43 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:47:18 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
Mar 28 11:54:14 ftcmp02 systemd[1]: alba-maintenance_be1-AJe4J5QgWTxi0NjX.service: Failed with result 'core-dump'.
root@ftcmp03:~# journalctl -u alba-maintenance_be1-6M2cKnTmOjO7TxSC.service | grep core-dump
Mar 28 09:23:14 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 09:51:40 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 09:52:05 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 10:17:30 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 10:17:49 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:37:29 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:37:43 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:37:58 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:38:32 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:42:10 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:42:42 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:42:56 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:43:12 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:43:28 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:43:47 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:44:07 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:44:38 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:46:22 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:52:11 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:52:25 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:52:39 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:53:16 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:56:31 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:56:50 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:57:08 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:57:24 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:57:42 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:58:18 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:58:38 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
Mar 28 11:59:00 ftcmp03 systemd[1]: alba-maintenance_be1-6M2cKnTmOjO7TxSC.service: Failed with result 'core-dump'.
root@ftcmp04:~# journalctl -u alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service | grep core-dump
Mar 28 11:23:05 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Mar 28 11:24:05 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Mar 28 11:25:40 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.
Mar 28 11:26:26 ftcmp04 systemd[1]: alba-maintenance_be1-LZ2KXPnZU7RfkqsM.service: Failed with result 'core-dump'.

Total fill rate:

  • ~85% 10:58
  • ~90% 11:25
  • ~92% 11:35
  • ~93% 11:40
  • ~94% 11:45 (disks on 1st 2 nodes filled ~98%)

stopped adding data ~11:50

coredumps available on 10.100.186.33

root@ftcmp03:~# coredumpctl list
TIME                            PID   UID   GID SIG PRESENT EXE
Tue 2017-03-28 09:56:29 CEST  26020     0     0  11   /usr/bin/alba
Tue 2017-03-28 09:56:47 CEST   6304     0     0  11   /usr/bin/alba
Tue 2017-03-28 10:20:06 CEST  18311     0     0  11   /usr/bin/alba
Tue 2017-03-28 11:40:48 CEST  18464     0     0  11   /usr/bin/alba
Tue 2017-03-28 11:41:20 CEST  22791     0     0  11   /usr/bin/alba
Tue 2017-03-28 11:41:43 CEST  22523     0     0  11   /usr/bin/alba
Tue 2017-03-28 11:42:37 CEST  22628     0     0  11   /usr/bin/alba
Tue 2017-03-28 11:46:03 CEST  23071     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:47:50 CEST  25384     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:48:04 CEST  25536     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:48:53 CEST  25725     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:50:50 CEST  26045     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:55:24 CEST  29796     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:55:55 CEST  29945     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:57:11 CEST  29686     0     0  11 * /usr/bin/alba
Tue 2017-03-28 11:59:34 CEST  30221     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:00:37 CEST  31778     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:00:58 CEST  32235     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:01:01 CEST  31930     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:01:19 CEST  32115     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:01:36 CEST  32742     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:02:08 CEST    609     0     0   6 * /usr/bin/alba
Tue 2017-03-28 12:02:09 CEST    423     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:02:20 CEST  32401     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:06:00 CEST    914     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:06:28 CEST   2413     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:06:48 CEST   3097     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:06:51 CEST   2773     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:08:41 CEST   3272     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:12:33 CEST   3467     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:14:16 CEST   5762     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:16:45 CEST   5930     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:18:13 CEST   7589     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:23:17 CEST   7743     0     0  11 * /usr/bin/alba
Tue 2017-03-28 12:26:01 CEST  10868     0     0  11 * /usr/bin/alba

For pid 609 the coredump is also saved to sigabrted

coredumpctl dump 609 > sigabrted

and can be examined via

gdb /usr/bin/alba sigabrted

in case it's rotated away.

Update: most dumps mentioned above are "rotated" away (2017/03/29)

domsj commented

probably fixed by #705