openvstorage/alba

deleted namespaces blow up maintenance processes.

Closed this issue · 3 comments

On an environment where there were more than 1 million deleted namespaces (don't ask),
maintenance suffers all kind of problems.
The most pressing issue is that maintenance blows up on PropagatePreset, pulling abms and nsms under with it.

Unfortunately, it's a combination of things:
PropagatePreset makes the maintenance agent come here:
https://github.com/openvstorage/alba/blob/1.3.19/ocaml/src/alba_client_preset_cache.ml#L25

       | Some (version', namespace_ids) ->
          Lwt_list.map_p
            (fun namespace_id ->
              Lwt.catch
                (fun () ->
nsm_host_access # get_namespace_info ~namespace_id >>= fun (ns_info, _, _) -> ...

on this environment, namespace_ids contains 130k items. You can check it via

nuvolat@NY2SRV0001:~/romain$ ./alba.native get-presets-propagation-state --preset \
  global_no_encrypt  \
  --config arakoon://config/ovs/arakoon/ny2-hddbackend02-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini
2017-07-20 07:03:27 337511 -0400 - NY2SRV0001 - 29962/0000 - alba/cli - 0 - info - Albamgr_client.make_client :ny2-hddbackend02-abm
2017-07-20 07:03:27 338466 -0400 - NY2SRV0001 - 29962/0000 - alba/cli - 1 - info - connect_with : 172.17.23.7 26408 None Net_fd.TCP (fd:3)
2017-07-20 07:03:27 531782 -0400 - NY2SRV0001 - 29962/0000 - alba/cli - 2 - info - prop_states = [(Some (3L,;
   [1006819L; 1006818L; 1006817L; 1006816L; 1006815L; 1006814L;;
    1006813L; 1006812L; 1006811L; 1006810L; 1006809L; 1006808L;;
...

Most of the items point to namespaces that no longer exist. Since the namespaces don't exist,
they are not in the cache, so the arakoon is contacted for every one of them. If it doesn't succumb under the load, it will confirm that these don't exist. If you turn on debug, you see an excess of mgr # list_namespaces_by_id to fill up the namespace info cache, and later (because this doesn't help as the namespaces don't exist) mgr # get_namespace_by_id ~namespace_id.

I patched an alba maintenance agent with

  • a primitive client side dedupe of tasks
  • replacing the Lwt_list.map_p and Lwt_list.iter_p with _s in the preset_cache
  • a hackish dedupe of method refresh ~preset_name
    to make sure there's only 1 PropagatePreset going on at any point,

but it's ugly, and done under pressure (branch on EE repo), while trying to reduce the number of maintenance tasks on that environment (which was > 7e6 at some point in time). It needs to be revisited.

if we're merely fighting symptoms we can also contemplate a bloom filter for the deleted namespaces ...

domsj commented

I made an improvement related to this ticket, see #803.
@toolslive do we consider that enough to close this ticket, or de we want to do more?

this should be enough.