openvstorage/alba

Alba get-disk-safety throws namespace does not exists exception on an accel backend

Closed this issue · 2 comments

The healthcheck call the alba get-disk-safety --config.... command (without the extra option namespace) for verifying the disk safety of all the namespaces.

command:

namespaces = AlbaCLI.run(command='get-disk-safety', config=config)

version:

1.3.4
git_revision: "tags/1.3.4-0-gdabf9dd"
git_repo: "https://github.com/openvstorage/alba.git"
compile_time: "19/01/2017 22:18:42 UTC"
machine: "a8c9829c5800 4.4.0-36-generic x86_64 x86_64 x86_64 GNU/Linux"
model_name: "Quad-Core AMD Opteron(tm) Processor 2350"
compiler_version: "4.03.0"

The ssdbackend is an accel albabackend.
Usage: 89.76% used (8.62 TB of 9.6 TB)
Maybe, at the moment we ran the command, a namespace was deleted before you could return the disk safety of it?

Error:

Could not fetch alba information for backend ssdBackend Message: Command 'get-disk-safety' failed with 'Albamgr exception(Albamgr_protocol.Protocol.Error.Namespace_does_not_exist,Albamgr_protocol.Protocol.Error.Namespace_does_not_exist)'.
domsj commented

I've looked at the code and couldn't immediately find a race condition (by a concurrent namespace delete) which could explain this.
There is another error that could be triggered by said race condition though, so there's at least 1 improvement we could make to the code. Need to investigate the code some more to find out what actually happened...

domsj commented

Looking into it some more I do see the race now ...
I think the error comes from disk_safety.ml:L43 -> nsm_host_access.ml:L236.
We probably want to catch it in disk_safety.ml:L104.

The race I mentioned in my previous comment is in alba.ml:L584, that shouldn't throw in case no namespaces were passed in on the cli.