Local backend is disqualified because one namespace has broken fragments
Closed this issue · 8 comments
We saw the following error in a proxy on the OVH setup on a global proxy:
Jan 05 14:30:28 perf-roub-01 alba[61295]: 2017-01-05 14:30:28 923020 +0100 - perf-roub-01 - 61295/0 - alba/maintenance - 20161060 - info - Disqualifying osd 1: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "map.ml", line 122, characters 16-25; Called from file "src/conv.ml", line 190, characters 10-37
OSD 1 is in this case a local backend and is disqualified because 1 namespace did not find enough fragments. This is maybe too painful for all other namespaces, because of the following example:
In this example we have 3 local backends with a policy of 2,1,2,1.
We have namespace A and this namespace is stored on all local backends
We have namespace B and this namespace is stored on the first two backends because of a timeout in backend 3 and policy requires only 2 backends to complete.
After some time, a disk breaks on a local backend and namespace A has Not Enough Fragments
on backend 2, so that backend is disqualified.
--> At this point namespace B will also come to a halt because he needs at least 2 backends to read.
@wimpers this is not of type enhancement, this is of type bug because this can cripple your setup in an instant
Some remarks:
- a disk breaking on a local backend shouldn't immediately result in
NotEnoughFragments
... usually there's some redundancy on the local backends too. - disqualifying of an osd in the proxy only results in us not using it for new uploads, downloads will still happily try to use it
- you added the type_enhancement label yourself ;-)
So it's not that bad, but nonetheless I'll have a look at it in the near future
@domsj what do you have in mind to fix/improve this?
Based upon a read failure, you disqualify for writes but still happily try reads. Isn't this a bit strange?
@wimpers that is indeed a bit strange. We could change the behaviour so that only errors on write result in disqualifying the osd for new writes
no it did not also fix this one
@domsj @toolslive is this still an issue?
Fixed in EE version but not in OSE version.