openvstorage/alba

maintenance tasks increases when asd is down (cleanuposdnamespace)

Opened this issue · 3 comments

I know it is normal that the tasks on the maintenance increases but after 15min the chunks on this asd will be created on other asds.
So if an asd is down for x hours the maintenance agent should ignore these tasks.

In my example osd 14 and 15 are down for 3 days and the amount of work still increase on the maintenance agent while the auto repair timeout is 900 seconds.

Maintenance config = {
  "enable_auto_repair": true,
  "auto_repair_timeout_seconds": 900.0,
  "auto_repair_disabled_nodes": [],
  "enable_rebalance": true,
  "cache_eviction_prefix_preset_pairs": {},
  "redis_lru_cache_eviction": {
    "host": "172.17.16.22",
    "port": 6379,
    "key": "alba_lru_56f58646-419d-4236-a868-e3b79ac8784d"
  }
}

work items:

54158935 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004123L))"
54159096 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004127L))"
54159149 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004126L))"
54159150 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (14L, 1004126L))"
54159271 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004134L))"
54159324 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004132L))"
54159377 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004129L))"
54159430 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004135L))"
54159483 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004128L))"
54159536 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004131L))"
54159537 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (14L, 1004131L))"
54159589 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004138L))"
54159590 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (14L, 1004138L))"
54159642 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004136L))"
54159704 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004143L))"
54159757 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004145L))"
54159863 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004137L))"
54159916 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004142L))"
54159917 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (14L, 1004142L))"
54159969 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004144L))"

amount of work items:
image

54159757 | "(Albamgr_protocol.Protocol.Work.CleanupOsdNamespace (15L, 1004145L))"

Delete all that's left on osd 15L from namespace 1004145L .
The namespace was deleted, but the osd was down, and the work item is kept in the work queue in te abm (and retried and tracked without success in the maintenance processes).

@toolslive

  • Would the items be removed from the queue in case the OSD is removed through the model?
  • Would it make sense not to keep the work items in the queue in case of a namespace delete which didn't succeed but try again periodically based upon the flag that the namespace is deleted?

If the OSD was purged, the CleanupOsdNamespace items will complete without problem.
The maintenance agent that does it, will log

   "UnknownOsd(%Li) => no cleanup to be done anymore

on info level.