martinsumner/leveled

Inker Manifest - lacks protection from corruption

martinsumner opened this issue · 5 comments

The leveled_imanifest does not read the file back after writing it, before switching it to be the active file. This means that a poorly timed crash can leave an empty active manifest file:

writer(Manifest, ManSQN, RootPath) ->
ManPath = leveled_inker:filepath(RootPath, manifest_dir),
ok = filelib:ensure_dir(ManPath),
% When writing during backups, may not have been generated
NewFN = filename:join(ManPath,
integer_to_list(ManSQN) ++ "." ++ ?MANIFEST_FILEX),
TmpFN = filename:join(ManPath,
integer_to_list(ManSQN) ++ "." ++ ?PENDING_FILEX),
MBin = term_to_binary(to_list(Manifest), [compressed]),
leveled_log:log("I0016", [ManSQN]),
ok = file:write_file(TmpFN, MBin),
ok = file:rename(TmpFN, NewFN),

The rename accepts the ok as proof that the file is written. It should check by reading the file.

This can then lead to this on restart:

<0.909.0>@riak_kv_vnode:init:806 Failed to start riak_kv_leveled_backend backend for index 0 crash: {{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<>>],[]},{leveled_imanifest,reader,2,[{file,"src/leveled_imanifest.erl"},{line,160}]},{leveled_inker,build_manifest,3,[{file,"src/leveled_inker.erl"},{line,1009}]},{leveled_inker,start_from_file,1,[{file,"src/leveled_inker.erl"},{line,834}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}},[{leveled_bookie,startup,3,[{file,"src/leveled_bookie.erl"},{line,1641}]},{leveled_bookie,init,1,[{file,"src/leveled_bookie.erl"},{line,1221}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}},[{riak_kv_leveled_backend,start,2,[{file,"src/riak_kv_leveled_backend.erl"},{line,148}]},{riak_kv_vnode,init,1,[{file,"src/riak_kv_vnode.erl"},{line,763}]},{riak_core_vnode,do_init,1,[{file,"src/riak_core_vnode.erl"},{line,224}]},{riak_core_vnode,started,2,[{file,"src/riak_core_vnode.erl"},{line,207}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}

This can be resolved by removing the empty file.

However, should we have greater protection against this (e.g. read before rename)? Should the handling of corruption be better automated?

Also, should the file contents be CRC checked? Should we be able to rebuild a manifest from the state of the Journal files?

Just adding the sync flag to file:write_file/3 would help - but that isn't available in R16.