`Join` with context cancelation
dimitarvdimitrov opened this issue · 0 comments
Description
The existing (*Memberlist).Join
method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.
For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join
cannot be gracefully shut down until Join
returns.
Proposal
Add context.Context
argument to (*Memberlist).Join
and check it between pushPull
ing with each node.
Alternatively, if you don't want to break existing client, we can create a new method JoinContext
which does the above.
I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.