deleted namespaces blow up maintenance processes.
Closed this issue · 3 comments
On an environment where there were more than 1 million deleted namespaces (don't ask),
maintenance suffers all kind of problems.
The most pressing issue is that maintenance blows up on PropagatePreset
, pulling abms and nsms under with it.
Unfortunately, it's a combination of things:
PropagatePreset
makes the maintenance agent come here:
https://github.com/openvstorage/alba/blob/1.3.19/ocaml/src/alba_client_preset_cache.ml#L25
| Some (version', namespace_ids) ->
Lwt_list.map_p
(fun namespace_id ->
Lwt.catch
(fun () ->
nsm_host_access # get_namespace_info ~namespace_id >>= fun (ns_info, _, _) -> ...
on this environment, namespace_ids contains 130k items. You can check it via
nuvolat@NY2SRV0001:~/romain$ ./alba.native get-presets-propagation-state --preset \
global_no_encrypt \
--config arakoon://config/ovs/arakoon/ny2-hddbackend02-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini
2017-07-20 07:03:27 337511 -0400 - NY2SRV0001 - 29962/0000 - alba/cli - 0 - info - Albamgr_client.make_client :ny2-hddbackend02-abm
2017-07-20 07:03:27 338466 -0400 - NY2SRV0001 - 29962/0000 - alba/cli - 1 - info - connect_with : 172.17.23.7 26408 None Net_fd.TCP (fd:3)
2017-07-20 07:03:27 531782 -0400 - NY2SRV0001 - 29962/0000 - alba/cli - 2 - info - prop_states = [(Some (3L,;
[1006819L; 1006818L; 1006817L; 1006816L; 1006815L; 1006814L;;
1006813L; 1006812L; 1006811L; 1006810L; 1006809L; 1006808L;;
...
Most of the items point to namespaces that no longer exist. Since the namespaces don't exist,
they are not in the cache, so the arakoon is contacted for every one of them. If it doesn't succumb under the load, it will confirm that these don't exist. If you turn on debug
, you see an excess of mgr # list_namespaces_by_id
to fill up the namespace info cache, and later (because this doesn't help as the namespaces don't exist) mgr # get_namespace_by_id ~namespace_id
.
I patched an alba maintenance agent with
- a primitive client side dedupe of tasks
- replacing the
Lwt_list.map_p
andLwt_list.iter_p
with_s
in the preset_cache - a hackish dedupe of
method refresh ~preset_name
to make sure there's only 1PropagatePreset
going on at any point,
but it's ugly, and done under pressure (branch on EE repo), while trying to reduce the number of maintenance tasks on that environment (which was > 7e6 at some point in time). It needs to be revisited.
if we're merely fighting symptoms we can also contemplate a bloom filter for the deleted namespaces ...
I made an improvement related to this ticket, see #803.
@toolslive do we consider that enough to close this ticket, or de we want to do more?
this should be enough.