openvstorage/alba

Decommissioning a broken backend takes too long

Opened this issue · 7 comments

We've got cluster A & B. In this situation cluster A is connected with cluster B through a global backend and external local backend (and 2,1,2,1 preset).
We saw that cluster B was broken. We unlinked the external local backend from the global backend. But after listing the osds on the global backend, after an hour we still saw the backend in decommissioned mode on the global proxies.

I investigated together with @domsj, the maintenance is not doing anything important and is not consuming much resources. What we do see is a lot of connections still to the old backend. (Connections refused because the cluster B is totally dead)

The alba version on functional cluster A is 1.3.0, @domsj asked me to upgrade it to 1.3.2 because of improvements to alba handling disk/data loss (https://github.com/openvstorage/alba/releases/tag/1.3.2)

After updating from alba 1.3.0 to 1.3.1 the decommissioned alba backend is gone and the proxy does not try to connect anymore to the old backend.

To try to reproduce this issue with alba 1.3.1 I will recreate the situation with the current OVH setup, shutdown 1 backend and remove it.

PLease reproduce with latest alba

I've tried to reproduce the issue and today we've observed the following:

Steps to reproduce

  • Create a global backend
  • Add 2 local backends & 1 external local backend with policy (1, 2, 1, 3)
  • Create some vdisks and add some data to the vdisks (in my case I wrote approx. 10GB of data)
  • I broke 1 external local backend (lazy umount of asd mountpoints)
  • I deleted the external local backend (success)
  • Checked the proxy list osds to see if the osd is gone, but after 15 min. it was still present. (but in decommissioned state)
  • After discussion with @domsj we saw that the old bucket was still present in some namespaces.
  • After the old bucket was gone (after 30 min.) the OSD was gone in the proxy

Conclusion

the maintenance agent should notify the namespace quicker that the old bucket is gone for good.

domsj commented

Discussed this with @toolslive, we can (and will) make an improvement here in the near future

Is that near future already over? Near future sounds like days or weeks, not 3-4 months :)

domsj commented

Sorry I can't recall what improvements we had in mind. @toolslive perhaps you can remember?
Looking at the release notes I don't see it either