Inker Manifest - lacks protection from corruption
martinsumner opened this issue · 5 comments
The leveled_imanifest does not read the file back after writing it, before switching it to be the active file. This means that a poorly timed crash can leave an empty active manifest file:
leveled/src/leveled_imanifest.erl
Lines 166 to 177 in f907fb5
The rename accepts the ok
as proof that the file is written. It should check by reading the file.
This can then lead to this on restart:
<0.909.0>@riak_kv_vnode:init:806 Failed to start riak_kv_leveled_backend backend for index 0 crash: {{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<>>],[]},{leveled_imanifest,reader,2,[{file,"src/leveled_imanifest.erl"},{line,160}]},{leveled_inker,build_manifest,3,[{file,"src/leveled_inker.erl"},{line,1009}]},{leveled_inker,start_from_file,1,[{file,"src/leveled_inker.erl"},{line,834}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}},[{leveled_bookie,startup,3,[{file,"src/leveled_bookie.erl"},{line,1641}]},{leveled_bookie,init,1,[{file,"src/leveled_bookie.erl"},{line,1221}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}},[{riak_kv_leveled_backend,start,2,[{file,"src/riak_kv_leveled_backend.erl"},{line,148}]},{riak_kv_vnode,init,1,[{file,"src/riak_kv_vnode.erl"},{line,763}]},{riak_core_vnode,do_init,1,[{file,"src/riak_core_vnode.erl"},{line,224}]},{riak_core_vnode,started,2,[{file,"src/riak_core_vnode.erl"},{line,207}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
This can be resolved by removing the empty file.
However, should we have greater protection against this (e.g. read before rename)? Should the handling of corruption be better automated?
Also, should the file contents be CRC checked? Should we be able to rebuild a manifest from the state of the Journal files?
This is the standard write/rename dance used by Riak:
Just adding the sync
flag to file:write_file/3
would help - but that isn't available in R16.