openvstorage/alba

Investigate behaviour of maintenance process in case there are unavailable/broken objects

Closed this issue · 6 comments

domsj commented

We want to make sure that maintenance does not start spinning, it can retry once every minute or so though...
If the namespace is used as a cache: maybe delete the object?

Do you have a specific test case in mind? is this something QA needs to pickup or just a stub for yourself.

domsj commented

shut down a few too many asds, see what maintenance does
purge a few osds (so many that the gui asks to confirm you're willing to accept dataloss), see what maintenance does

(need to wait at least 1 minute for the reaction from maintenance)

we can do this ourselves or QA could do it
(I was thinking of doing it myself one of these days)

not sure yet if I can/want to make this an automatic test

please investigate

accidentally investigated this on my env.
had dataloss due to removal of too many disks, and saw 3500+ gets/s on the asds. maintenance was consuming 100%+ cpu.
after removing the namespaces which had dataloss (and restarting maintenance) the activity stopped.

domsj commented

additionally we noticed that purging osds were not yet fully removed
(nsm_client # update_manifest failed with Insufficient_fragments ... because the nsm won't tolerate dataloss)

domsj commented

fixed by #532