openvstorage/alba

Repair namespace

Opened this issue · 10 comments

Is it possible to implement an option to repair a single namespace? Or an option where you can ask the maintenance agent to start repair all namespace below disk safety < n >.

Via the healthcheck we get an error message of the disk safety when a namespace is at disk safety 0. We get a warning message once the disk safety is greater then 0 en smaller then the maximum disk safety.

@domsj doesn't the maintenance agent repairs the objects with lowest disk safety first automatically?

domsj commented

@wimpers yes it does so, within the context of a namespace, for all namespaces in parallel.
It doesn't prioritize repair for what are globally (across all namespaces) the objects with lowest disk safety.

Please implement a cli command to repair a certain namespace. I don't think this should be part of the maintenance process.

domsj commented

Just to be sure: you think data safety is not a main concern for alba, and that this responsibility should be shifted towards the user (or the framework)?

Let me clarify,

The maintenance agent should be adjusted so it repairs objects with lowest safety globally first.
The maintenance agent should not provide functionality to fix a certain namespace.
A cli needs to be provides so admins can fix a certain namespace by means of a CLI.

@domsj does that clear things up ...

Some things are so important you don't want to leave them to admins.
If a namespace is more important than another namespace, then its preset should reflect that.
If there are 2 objects in subawesome shape, it's their safety that tells you which one should get priority, and not it's namespace_id

domsj commented

@wimpers ok, so you want 2 changes. so split up the ticket?

regarding the first change (repair globally weakest object first):
Can you settle for "a maintenance process repairs the weakest object that it is responsible for"?
Currently maintenance work is sharded over the available maintenance processes, based on the modulo of the namespace_id.
(I don't see yet how to efficiently implement repairing the globally weakest object in combination with sharding the maintenance work, hence the proposed alternative.)

regarding the second change (repair a certain namespace using the cli):
what is the point of having this? especially if we would implement the first change?
(I think this one isn't a lot of work though)

Can you settle for "a maintenance process repairs the weakest object that it is responsible for"?

Yes, lets' do that in this ticket.

regarding the second change (repair a certain namespace using the cli):
what is the point of having this? especially if we would implement the first change?

I can imagine a case where OPS want to fix a certain namespace as it has higher importance (remember we don't allows to move vDisks between vPools and the importance may change). @jeroenmaelbrancke @matthiasdeblock @jtorreke pelase advise

Possible case

Last week, due to adding huge amount of ASD's to a certain backend, we had to stop the maintenance agent for that certain backend. That was due to the possible high load on the re-balance.
Due to a ASD vm that went down (broken SD-card), a namespace lost a fragment and came in a 'only 15 of the 16 fragments available' state.
If we get into a state where we had to stop the maintenance agent for a certain backend, it could come in handy to maybe only repair a single namespace by hand. Which deploys a single maintenance process just to repair that namespace.