openvstorage/alba

Local backend is disqualified because one namespace has broken fragments

Closed this issue · 8 comments

We saw the following error in a proxy on the OVH setup on a global proxy:

Jan 05 14:30:28 perf-roub-01 alba[61295]: 2017-01-05 14:30:28 923020 +0100 - perf-roub-01 - 61295/0 - alba/maintenance - 20161060 - info - Disqualifying osd 1: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "map.ml", line 122, characters 16-25; Called from file "src/conv.ml", line 190, characters 10-37

OSD 1 is in this case a local backend and is disqualified because 1 namespace did not find enough fragments. This is maybe too painful for all other namespaces, because of the following example:

In this example we have 3 local backends with a policy of 2,1,2,1.
We have namespace A and this namespace is stored on all local backends
We have namespace B and this namespace is stored on the first two backends because of a timeout in backend 3 and policy requires only 2 backends to complete.
After some time, a disk breaks on a local backend and namespace A has Not Enough Fragments on backend 2, so that backend is disqualified.

--> At this point namespace B will also come to a halt because he needs at least 2 backends to read.

@wimpers this is not of type enhancement, this is of type bug because this can cripple your setup in an instant

domsj commented

Some remarks:

  • a disk breaking on a local backend shouldn't immediately result in NotEnoughFragments ... usually there's some redundancy on the local backends too.
  • disqualifying of an osd in the proxy only results in us not using it for new uploads, downloads will still happily try to use it
  • you added the type_enhancement label yourself ;-)

So it's not that bad, but nonetheless I'll have a look at it in the near future

@domsj what do you have in mind to fix/improve this?

Based upon a read failure, you disqualify for writes but still happily try reads. Isn't this a bit strange?

domsj commented

@wimpers that is indeed a bit strange. We could change the behaviour so that only errors on write result in disqualifying the osd for new writes

Did the fix for #737 also fix this one? Prob not?

domsj commented

no it did not also fix this one

@domsj @toolslive is this still an issue?

Fixed in EE version but not in OSE version.