k4yt3x/wg-meshconf

Bug: `PersistentKeepalive` is added to the wrong side.

rudolfbyker opened this issue · 6 comments

Here is a sample database.csv file:

"Name","Address","Endpoint","AllowedIPs","ListenPort","PersistentKeepalive","FwMark","PrivateKey","DNS","MTU","Table","PreUp","PostUp","PreDown","PostDown","SaveConfig"
"alpha","10.0.0.1","alpha.example.com","","51820","","","aCqUml43tjjw3SBVM+M9IyrApZ5pthzrFBMxWKnLPE0=","","","","","","","",""
"beta","10.0.0.2","beta.example.com","","51820","","","cJE+l3HYxY5eHzxUfvNP7i5nbR7TjaSVYwqxvjW4Hl4=","","","","","","","",""
"gamma","10.0.0.3","","","51820","25","","4B6WZu3OyCXIOWyuN7dFKh/FGsVIFzkSuGsszdMsLlg=","","","","","","","",""

This generates the following config:

alpha:

[Interface]
# Name: alpha
Address = 10.0.0.1
PrivateKey = aCqUml43tjjw3SBVM+M9IyrApZ5pthzrFBMxWKnLPE0=
ListenPort = 51820

[Peer]
# Name: beta
PublicKey = PBFFzDSLOn0CyxF2d0SWo8F0xktozEQFflmIYgYDg14=
Endpoint = beta.example.com:51820
AllowedIPs = 10.0.0.2

[Peer]
# Name: gamma
PublicKey = HV5qryi3YcrhKQd/4A0h6xxWr+ARlncT06K+BY9XsCU=
AllowedIPs = 10.0.0.3
PersistentKeepalive = 25

beta:

[Interface]
# Name: beta
Address = 10.0.0.2
PrivateKey = cJE+l3HYxY5eHzxUfvNP7i5nbR7TjaSVYwqxvjW4Hl4=
ListenPort = 51820

[Peer]
# Name: alpha
PublicKey = KE5NYiNewB3VwoIHGPXBCDxGphf6m3gGUyLyhy7Vd2A=
Endpoint = alpha.example.com:51820
AllowedIPs = 10.0.0.1

[Peer]
# Name: gamma
PublicKey = HV5qryi3YcrhKQd/4A0h6xxWr+ARlncT06K+BY9XsCU=
AllowedIPs = 10.0.0.3
PersistentKeepalive = 25

gamma:

[Interface]
# Name: gamma
Address = 10.0.0.3
PrivateKey = 4B6WZu3OyCXIOWyuN7dFKh/FGsVIFzkSuGsszdMsLlg=
ListenPort = 51820

[Peer]
# Name: alpha
PublicKey = KE5NYiNewB3VwoIHGPXBCDxGphf6m3gGUyLyhy7Vd2A=
Endpoint = alpha.example.com:51820
AllowedIPs = 10.0.0.1

[Peer]
# Name: beta
PublicKey = PBFFzDSLOn0CyxF2d0SWo8F0xktozEQFflmIYgYDg14=
Endpoint = beta.example.com:51820
AllowedIPs = 10.0.0.2

I have read all I could find about PersistentKeepalive (which is not a lot), and it sounds to me like it should only be specified in the config of the node which is behind NAT, which is gamma in this case. But in the above example, it's specified everywhere except in gamma's config.

Is the bug in wg-meshconf, or in my understanding of PersistentKeepalive?

My use case is to have a VPN of VMs, some of which are behind NAT, without public IPs, but most of which have public IPs. The servers without the public IPs should route traffic through one of the servers which DO have public IPs to reach other servers with don't have public IPs.

And yes, I'm trying to replace tinc :)

Long story short, follow this:

PersistentKeepalive should be specified on the node BEHIND the NAT.

You'll need this at all because the nodes that have a public IP can always be reached without an existing NAT connection, where as nodes behind NAT cannot be reached actively without maintaining a NAT connection on the firewall or whatever that's forwarding the connection. Therefore, you use PersistentKeepalive to keep that connection alive so the outside will always be able to launch an active connection to the node behind NAT. It won't work the other way around because the node outside cannot reach the node behind NAT, so the initial connection won't get established.

Is the bug in wg-meshconf, or in my understanding of PersistentKeepalive?

With that said, your understanding seems to be correct. I'll take a look at the code, it might be a bug.

Thanks for the feedback :)

I have done some tests … It seems like PersistenKeepalive sort-of works when specified in the "wrong" (i.e. the public) VM's config file, but only after you manually make the first connection from the "correct" (i.e. private behind NAT) VM. A simple ping -c 1 ... works. After that, the public one keeps it going by pinging back every 25 seconds. But when it breaks down for whatever reason, I need to ping again manually from the private VM to get it working again.

So it would be nice if we could turn this around in the config generator.

Looking into the code, it seems that the problem here is that PersistentKeepAlive is listed as a "peer attribute", which in the genconfig method is treated as something that should be set in the [Peer] section of OTHER nodes. But this specific attribute should be set on that specific peer, since it should call out to other peers to let them know where it is, since it does not have a public endpoint.

So I think we need to distinguish between two types of [Peer] attributes:

  • Those that must be set on the [Peer] section for this peer, in the config of all other nodes.
  • Those that must be set on all [Peer] sections in the config of this peer.

PersistentKeepAlive is implemented as the former, but it should be the latter.

@k4yt3x my PR #33 fixes this bug. Please consider merging.

k4yt3x commented

Hey @rudolfbyker thanks for the PR. Just give me a bit of time to verify that everything works fine and the change makes sense.