KafkaSinkCluster - evict items from stored metadata when removed from cluster

Question

KafkaSinkCluster - evict items from stored metadata when removed from cluster

Opened this issue 4 months ago · 0 comments

We need a way to remove dead items from the cluster state.

This includes KafkaSinkCluster::topic_by_name, KafkaSinkCluster::topic_by_id and KafkaSinkCluster::group_to_coordinator_broker.

Some possible ways to achieve this:

Store a last updated timestamp for each item and then evict if not updated for more than X hours.
Store a last used timestamp for each item and then evict if not used for more than X hours.
run a background task every X hours:
- fetch a clean set of metadata
- iterate over existing metadata deleting any that doesnt exist in the fetched metadata.
run a background task every X hours that just completely clears metadata.
clear collection if collection exceeds N items BEFORE adding a new item.
remove a random item from the collection if collection exceeds N items BEFORE adding a new item.

I think 6 is the best option:

Using the collection length instead of time passing means we closer map to the actual memory usage. (unlike 1-4)
The implementation is simple (unlike 3)
Avoiding a last used timestamp which is expensive to maintain (unlike 1)
Avoids a last updated timestamp which does not accurately describe usage (unlike 2)
avoids having to recollect all metadata from scratch (unlike 3 and 4)

The downsides are:

We might need the item that is randomly deleted
- in fact I think this will cause a race condition between metadata fetch and routing. This isnt a huge problem though as we fallback to random node selection, returning a routing error to the client who will just try again.
- However, a neat effect here is:
  - if we have lots of unused entries its likely we'll remove an unused entry.
  - If we have few unused entries we'll likely remove an entry we are using but we are also hitting the limits of the system anyway, so degraded performance in this circumstance is ok.

Having looked at the way topics/partitions end up being used in production, this approach is a bit naive.
The limit will need to be enforced on the number of partitions not the number of topics, since number of topics tends to actually be quite low.
But then its not clear which topic to evict.
Should we prefer evicting topics with lots of partitions?
Should we be evicting at the partition level instead of the topic level?
no, because: partitions are already evicted automatically when new metadata is fetched for that topic.
So we just need to delete old topics.

Maybe option 2 would be better after all.

reducing memory usage

Its also worth thinking about ways to reduce memory usage in the first place:

We could combine the shotover_rack_replica_nodes + external_rack_replica_nodes fields into one vec by having a first_external_node_index: u64 field
We could avoid duplicating topic_by_name and topic_by_id` somehow.
- Making each partition into a Arc<Partition> should remove the vast majority of duplicated memory while keeping topics mutable.