hashicorp/memberlist

`Join` with context cancelation

dimitarvdimitrov opened this issue · 0 comments

Description

The existing (*Memberlist).Join method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.

For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join cannot be gracefully shut down until Join returns.

Proposal

Add context.Context argument to (*Memberlist).Join and check it between pushPulling with each node.

Alternatively, if you don't want to break existing client, we can create a new method JoinContext which does the above.

I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.