openvstorage/alba

Segfault errors in dmesg for ALBA

Closed this issue · 3 comments

Alba version: 1.3.4

Alba crashed with the following errors in dmesg:

[Sat Feb 11 23:51:00 2017] alba[12711]: segfault at 8 ip 00000000007394e4 sp 00007ffc37e3f680 error 4 in alba[400000+be4000]
[Sat Feb 11 23:53:25 2017] alba[4451]: segfault at 0 ip 000000000088459f sp 00007fffc01e9010 error 4 in alba[400000+be4000]
[Sat Feb 11 23:53:37 2017] alba[5509]: segfault at 0 ip 000000000088459f sp 00007fff00aa6040 error 4 in alba[400000+be4000]
[Sat Feb 11 23:53:48 2017] alba[6066]: segfault at fffffffffffffff8 ip 0000000000dc562c sp 00007ffc1e088690 error 5 in alba[400000+be4000]
[Sat Feb 11 23:54:00 2017] alba[6269]: segfault at 0 ip 000000000088459f sp 00007ffe2b33e1d0 error 4 in alba[400000+be4000]
[Sat Feb 11 23:54:11 2017] alba[6473]: segfault at 68c3 ip 0000000000a70a3e sp 00007ffe9fe6ac60 error 4 in alba[400000+be4000]
[Sat Feb 11 23:54:23 2017] alba[7349]: segfault at fffffffffffffff8 ip 0000000000dc562c sp 00007ffdabdc4890 error 5 in alba[400000+be4000]
[Sat Feb 11 23:54:35 2017] alba[7759]: segfault at 8 ip 00000000007380eb sp 00007ffc95a83920 error 4 in alba[400000+be4000]
[Sat Feb 11 23:54:47 2017] alba[7827]: segfault at 0 ip 000000000088459f sp 00007ffce59e9510 error 4 in alba[400000+be4000]
[Sat Feb 11 23:55:00 2017] alba[8038]: segfault at 0 ip           (null) sp 00007ffd11415278 error 14 in alba[400000+be4000]
[Sat Feb 11 23:55:12 2017] alba[8467]: segfault at 8 ip 00000000009582bd sp 00007fff92d41a40 error 4 in alba[400000+be4000]
[Sat Feb 11 23:55:24 2017] alba[9018]: segfault at 8 ip 00000000007380eb sp 00007ffe6603d110 error 4 in alba[400000+be4000]
[Sat Feb 11 23:55:35 2017] alba[9248]: segfault at 0 ip           (null) sp 00007ffedeb0d988 error 14 in alba[400000+be4000]
[Sat Feb 11 23:55:59 2017] alba[10245]: segfault at 0 ip 000000000088459f sp 00007ffca4f9b080 error 4 in alba[400000+be4000]
[Sat Feb 11 23:56:10 2017] alba[10361]: segfault at 0 ip           (null) sp 00007ffdc29a2838 error 14 in alba[400000+be4000]
[Sat Feb 11 23:56:22 2017] alba[10795]: segfault at 40 ip 00000000007382e7 sp 00007ffea17b6198 error 4 in alba[400000+be4000]
[Sat Feb 11 23:56:33 2017] alba[11615]: segfault at 8 ip 00000000007394e4 sp 00007ffe34291e00 error 4 in alba[400000+be4000]
[Sat Feb 11 23:56:45 2017] alba[11754]: segfault at 7eabe ip 0000000000e2a560 sp 00007ffc4c67eaf0 error 4 in alba[400000+be4000
...

The crash file on stor-03 got triggered at the same time slot as the above dmesg messages from stor-03.

root@stor-03:~# ls alba_crash/ -alh
total 204M
drwxr-xr-x 1 root root   58 Feb 13 15:03 .
drwx------ 1 root root 1.3K Feb 13 15:00 ..
-rw-r----- 1 root root 204M Feb 11 23:53 _usr_bin_alba.0.crash_stor-03_11022017

Due to the apparmor bug (not overwriting crash files) there wasn't a crash file on the other 3 nodes from the above time slot. I've captured them as well but the crash files are from 08/02/2017 (am).

gig@be-g8-4-ctrl01:~/alba_crash$ ls -alh
total 911M
drwxrwxr-x 1 gig gig  304 Feb 13 15:13 .
drwxr-xr-x 1 gig gig 4.4K Feb 13 15:10 ..
-rw-r----- 1 gig gig 290M Feb 13 15:11 _usr_bin_alba.0.crash_stor-01_08022017
-rw-r----- 1 gig gig 217M Feb 13 15:12 _usr_bin_alba.0.crash_stor-02_08022017
-rw-r----- 1 gig gig 204M Feb 13 15:12 _usr_bin_alba.0.crash_stor-03_11022017
-rw-r----- 1 gig gig 202M Feb 13 15:12 _usr_bin_alba.0.crash_stor-04_08022017

You can find the crash files on the fileserver.

On stor-03 (core is 3.1GB)

Core was generated by `/usr/bin/alba maintenance --config arakoon://config/ovs/alba/backends/c70f8dc2-'.
...
(gdb) bt
#0  0x00000000007394e4 in caml_apply2 ()
#1  0x00000000008c643b in camlRange_query_args__to_buffer_3523 () at src/range_query_args.ml:30
#2  0x0000000000897bd5 in camlAlbamgr_client__do_request_6293 () at src/albamgr_client.ml:636
#3  0x000000000084f69f in camlLwt_pool2__fun_1488 () at src/tools/lwt_pool2.ml:73
#4  0x0000000000d8bb2c in camlLwt__catch_8694 () at src/core/lwt.ml:686
#5  0x0000000000d8c0a4 in camlLwt__try_bind_13734 () at src/core/lwt.ml:769
#6  0x0000000000899941 in camlAlbamgr_client__fun_10581 () at src/albamgr_client.ml:292
#7  0x0000000000812ee4 in camlOsd_access__refresh_osd_info_7620 () at src/osd_access.ml:321
#8  0x0000000000810e96 in camlOsd_access__fun_16445 () at src/osd_access.ml:383
#9  0x0000000000d889b6 in camlLwt__fun_32579 () at src/core/lwt.ml:698
#10 0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#11 0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#12 0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#13 0x0000000000d8a57c in camlLwt__safe_run_waiters_1432 () at src/core/lwt.ml:299
#14 0x0000000000d6c7c0 in camlLwt_unix__fun_7954 () at src/unix/lwt_unix.ml:214
#15 0x0000000000d887c6 in camlLwt__fun_32562 () at src/core/lwt.ml:653
#16 0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#17 0x0000000000d8a57c in camlLwt__safe_run_waiters_1432 () at src/core/lwt.ml:299
#18 0x0000000000d8795d in camlLwt_sequence__loop_1257 () at src/core/lwt_sequence.ml:149
#19 0x0000000000d6b7f4 in camlLwt_main__run_1324 () at src/unix/lwt_main.ml:43
#20 0x0000000000a0260c in camlCmdliner__eval_term_33714 () at src/cmdliner.ml:1350
#21 0x0000000000a02b57 in camlCmdliner__eval_choice_34756 () at src/cmdliner.ml:1390
#22 0x000000000073fe0d in camlAlba__entry () at src/alba.ml:913
#23 0x000000000072e099 in caml_program ()
#24 0x0000000000e4189e in caml_start_program ()
#25 0x0000000000e2939d in caml_main ()
#26 0x000000000072b13c in main ()

On stor-04 the backtrace looks completely different:

(gdb) bt
#0  0x000000000087c8ae in camlBytes_descr__create_3572 () at src/tools/bytes_descr.ml:40
#1  0x00000000008573ea in camlCompressors__fun_4516 () at src/tools/compressors.ml:256
#2  0x0000000000843b2f in camlRecovery_info__t$27_to_t_6297 () at src/recovery_info.ml:117
#3  0x0000000000843ca3 in camlRecovery_info__make_6613 () at src/recovery_info.ml:158
#4  0x00000000008398ff in camlMaintenance_helper__fun_8948 () at src/maintenance_helper.ml:137
#5  0x0000000000d887c6 in camlLwt__fun_32562 () at src/core/lwt.ml:653
#6  0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#7  0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#8  0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#9  0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#10 0x0000000000d8a297 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#11 0x0000000000d8a57c in camlLwt__safe_run_waiters_1432 () at src/core/lwt.ml:299
#12 0x0000000000da9bac in camlArray__iter_1247 () at array.ml:80
#13 0x0000000000e4189e in caml_start_program ()
#14 0x0000000000e3d519 in caml_callback ()
#15 0x00007f02de019d73 in ev_invoke_pending () from /usr/lib/x86_64-linux-gnu/libev.so.4
#16 0x0000000000e1b304 in lwt_libev_loop (val_loop=<optimized out>, val_block=<optimized out>) at src/unix/lwt_libev_stubs.c:98
#17 0x0000000000d69429 in camlLwt_engine__fun_2546 () at src/unix/lwt_engine.ml:151
#18 0x0000000000d6b7e8 in camlLwt_main__run_1324 () at src/unix/lwt_main.ml:41
#19 0x0000000000a0260c in camlCmdliner__eval_term_33714 () at src/cmdliner.ml:1350
#20 0x0000000000a02b57 in camlCmdliner__eval_choice_34756 () at src/cmdliner.ml:1390
#21 0x000000000073fe0d in camlAlba__entry () at src/alba.ml:913
#22 0x000000000072e099 in caml_program ()
#23 0x0000000000e4189e in caml_start_program ()
#24 0x0000000000e2939d in caml_main ()
#25 0x000000000072b13c in main ()

in all cases, it's the maintenance process.

domsj commented

probably fixed by #705