basho/riak

Crash during ring-resize

systream opened this issue · 2 comments

The following error occurred during ring-resize:

2022-04-03 14:54:13.020 [error] <0.22025.5>@riak_kv_put_core:count_physically_diverse:244 CRASH REPORT Process <0.22025.5> with 0 neighbours exited with reason: no match of right hand value false in riak_kv_put_core:count_physically_diverse/3 line 244 in gen_fsm:terminate/8 line 623

Version 3.0.9.

Approx half of the put requests were timeout.

Sorry, I don't have a helpful answer.

The ring-resizing feature was deprecated a while back. I can't find a reference to the notice, but it was never possible to do it reliably in a production environment, so it always had a warning against it, I think it might have now been removed from the documentation altogether.

This looks like a logical issue with ring resizing when it is combined with node_confirms. The node_confirms feature was added after ring-resizing was deprecated, so we wouldn't have been running any ring resizing tests at that stage.

The long term intention has been that nextgenrepl will be the answer, in that it allows replication (and reconciliation) between clusters with different ring sizes. however, nextgenrpel has had its own issues. There is a bunch of fixes coming in 3.0.10 which have made this much more stable.

I don't know what to suggest in the immediate term for you. I don't think there's an easy way of stopping PUTs going through this function, without changing the code. No idea if it is possible or safe to reverse out from a ring resizing operation.

Okey, sorry I did not notice that ring-resize has been deprecated.
It would be nice to have a warning message :).
Request eventually stored, it was not production.
I just thought that i report it.