Investigate behaviour of maintenance process in case there are unavailable/broken objects

Question

Investigate behaviour of maintenance process in case there are unavailable/broken objects

Closed this issue 8 years ago · 6 comments

We want to make sure that maintenance does not start spinning, it can retry once every minute or so though...
If the namespace is used as a cache: maybe delete the object?

Answer 1 · 2016-11-07T13:37:28.000Z

Do you have a specific test case in mind? is this something QA needs to pickup or just a stub for yourself.

Answer 2 · 2016-11-07T14:28:30.000Z

shut down a few too many asds, see what maintenance does
purge a few osds (so many that the gui asks to confirm you're willing to accept dataloss), see what maintenance does

(need to wait at least 1 minute for the reaction from maintenance)

we can do this ourselves or QA could do it
(I was thinking of doing it myself one of these days)

not sure yet if I can/want to make this an automatic test

Answer 3 · 2016-11-24T12:46:08.000Z

please investigate

Answer 4 · 2016-12-05T14:31:16.000Z

accidentally investigated this on my env.
had dataloss due to removal of too many disks, and saw 3500+ gets/s on the asds. maintenance was consuming 100%+ cpu.
after removing the namespaces which had dataloss (and restarting maintenance) the activity stopped.

Answer 5 · 2016-12-06T09:33:21.000Z

additionally we noticed that purging osds were not yet fully removed
(nsm_client # update_manifest failed with Insufficient_fragments ... because the nsm won't tolerate dataloss)

Answer 6 · 2016-12-26T13:38:31.000Z

fixed by #532