Implement a sidecar to manage connectivity and configuration changes in the cluster

Question

Implement a sidecar to manage connectivity and configuration changes in the cluster

hsanjuan opened this issue 2 years ago · 0 comments

This issue is to provide description and scope to the discussion of adding a sidecar service to handle some tasks around connectivity between peers in the cluster (or implement these in the reconciliation loop).

Problems

When a new ipfs peer is added, it should open connections to the relays and to the rest of the peers in the cluster
When a new relay is added, it should open connections to all the ipfs peers in the cluster (or better, the peers should open connections to it).
When a new ipfs peer is added, the relay should "allow" it
When a peer is removed, the relay should "disallow" it
Connectivity should survive or be re-established automatically if the connections between peers are down

It is relatively important that peers inside the cluster are well connected, through their internal IPs, so they can effortlessly locate and copy content from other peers. It is also important that the relays are locked and giving service to peers inside the cluster, and not to anyone in the IPFS network. In general, we are not going to see any connectivity problems in small demo clusters, but in larger clusters, if they happen, they will result in unreachable peers or content-fetching problems when replicating it from some peers to others.

We could resolve this problems with the right ipfs and relay daemon configuration. However, growing the cluster and having more peers would require configuration updates and restart of all peers (not acceptable).

Proposal

We need to have a helper program (perhaps in the form of a sidecar) that regularly talks to the ipfs peers that are deployed and uses the POST /api/v0/swarm/connect endpoint to establish connections between the peers in the cluster. Connections added via this endpoint have priority and should be relatively stable but still be dropped if really needed by the Connection Manager. The connection and its priority would also not survive a re-start, which is more worrying.

Additionally, there are several ways of approaching relay authorization when the peers in the cluster change:

Easiest: the relays are configured directly to allow a certain subnet to use them. This means the reconciliator should be aware of what the Kubernetes internal LAN for the cluster is and we can just configure it to that, so that only nodes in the LAN can use the relay services.
Alternative: we would need to expand relay functionality with an API to add/remove allowed peers, and call that API from the sidecar when new ipfs nodes appear or go away.

The sidecar should also help connect/re-connect cluster peers among themselves later, as they have the same issue.

For now the helper would do something like:

Find out addresses of all kubo containers in the cluster (using the Kubernetes API)
Call the /api/v0/id endpoint to find their peer ids
Trigger swarm/connect calls with the IPs and peer IDs of the peers in the cluster
(potentially authorize/deauthorize peers in the relays)

Open questions

I am unsure if we need a full separate sidecar, or we can program this inside the reconciliation code (it should preferably run regularly even if no chances have been made).
If we decide to have a sidecar, it should be able to talk to kubernetes and to our peers.. so it is similar to the operator itself in that it need access to Kubernetes APIs.
Per ipfs/kubo#6313, it seems that swarm/connect adds connections with higher priority, but this is probably not the same as "protected peers". Protected peer connections are never terminated and re-stablish themselves (I think, this needs to be verified). Thus it may be worth to push for a more flexible swarm/connect API in Kubo that allows adding protected peers.
If a swarm/connect adds a connection with priority, swarm disconnect should probably remove such priority? I am not sure if it does, and I am not sure if we should call swarm/disconnect when the cluster downscales.
We should investigate if the relay-daemon also runs a connection manager, and whether we can protect some connections from it (namely those from our peers).