Remote endpoint setting gets lost at or after reboot
vchrizz opened this issue · 8 comments
Package version
1.0.20220627-1
Firmware version
v2.0.9-hotfix.4
Device
EdgeRouter X (SFP) - e50
Issue description
Hi,
previously I used 1.0.20211208-1 which worked fine for me.
Then I upgraded to 1.0.20220627-1 where I noticed that the endpoint
setting gets lost after a reboot, setting the endpoint again re-establishes the wireguard connection again.
I upgraded my e50 device from v2.0.9-hotfix.2 to v2.0.9-hotfix.4 but that didn't fix the issue.
On another e50 device (EdgePoint EP-R6) which still has v2.0.9-hotfix.1 and 1.0.20211208-1 everything is fine, even after reboot.
So I suspect that some change in 1.0.20220627-1 causes that the endpoint setting gets lost when rebooting the e50 device.
Hope you can fix that.
Thanks and kind regards
Configuration and log output
set interfaces wireguard wg0 mtu 1420
set interfaces wireguard wg0 peer 1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU= allowed-ips 172.27.0.0/24
set interfaces wireguard wg0 peer 1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU= allowed-ips 10.5.44.0/24
set interfaces wireguard wg0 peer 1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU= endpoint 'server.mydomain.tld:51820'
set interfaces wireguard wg0 peer 1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU= persistent-keepalive 25
set interfaces wireguard wg0 peer 1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU= preshared-key <redacted>
set interfaces wireguard wg0 private-key <redacted>
set interfaces wireguard wg0 route-allowed-ips true
just for reference, information about devices where I notice the issue and where I do not notice the issue.
btw, which log would be helpful to find the cause for the issue?
grep -r wireguard /var/log/*
shows only matches in files dpkg.log
(package status) and in vyatta/cfg-stdout.log
setting up the peer.
here the issue exists (persistent, I tried to downgrade the wireguard package but it seems this is not easily possible, regardless in which way I try to downgrade, like overwriting with old package or purging new then reinstalling old and rebooting):
root@bec2-router:~# dpkg -l wireguard | grep ^ii
ii wireguard 1.0.20220627-1 mipsel fast, modern, secure kernel VPN tunnel
root@bec2-router:~# wg -v
wireguard-tools v1.0.20210914 - https://git.zx2c4.com/wireguard-tools/
root@bec2-router:~# show version | head -n1
Version: v2.0.9-hotfix.1
here also, even after upgrading EdgeOS (from v2.0.9-hotfix.1):
root@hei76-router:~# dpkg -l wireguard | grep ^ii
ii wireguard 1.0.20220627-1 mipsel fast, modern, secure kernel VPN tunnel
root@hei76-router:~# wg -v
wireguard-tools v1.0.20210914 - https://git.zx2c4.com/wireguard-tools/
root@hei76-router:~# show version | head -n1
Version: v2.0.9-hotfix.4
but here there is no such issue:
root@fl49-router:~# dpkg -l wireguard | grep ^ii
ii wireguard 1.0.20210606-2 mipsel fast, modern, secure kernel VPN tunnel
root@fl49-router:~# wg -v
wireguard-tools v1.0.20210914 - https://git.zx2c4.com/wireguard-tools/
root@fl49-router:~# show version | head -n1
Version: v2.0.9-hotfix.1
Update: I could reproduce this issue also with package 1.0.20211208-1 - so it seems the issue arised from this version:
root@ketz1-router:~# dpkg -l wireguard | grep ^ii
ii wireguard 1.0.20211208-1 mipsel fast, modern, secure kernel VPN tunnel
root@ketz1-router:~# wg -v
wireguard-tools v1.0.20210914 - https://git.zx2c4.com/wireguard-tools/
root@ketz1-router:~# show version | head -n1
Version: v2.0.9-hotfix.3
Another update: I tried to debug this further and noticed the issue with older version too..
On a router with wrong DNS server setting, that I could not set the endpoint:
root@gurk46-router# set interfaces wireguard wg0 peer "1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU=" endpoint server.mydomain.tld:51820
[edit]
root@gurk46-router# commit
[ interfaces wireguard wg0 peer 1syRMYD1jIVFMUMm5hF/j0MzjMQmuC5mlcT1VVugIkU= endpoint server.mydomain.tld:51820 ]
Try again: `server.mydomain.tld:51820'. Trying again in 1.00 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 1.20 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 1.44 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 1.73 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 2.07 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 2.49 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 2.99 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 3.58 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 4.30 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 5.16 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 6.19 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 7.43 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 8.92 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 10.70 seconds...
Try again: `server.mydomain.tld:51820'. Trying again in 12.84 seconds...
Try again: `server.mydomain.tld:51820'
Commit failed
[edit]
root@gurk46-router#
This particular case was just a DNS issue, but that got me thinking, maybe the issue has to do with DNS resolving at bootup what does not work until the router has a working Internet connection what causes the problem that the endpoint will not be saved.
In the other cases I tried to set the endpoint with IP instead and then rebooted the router to see if the endpoint gets lost again but this time the endpoint did not vanish.
So my guess for the root cause is as follows: on boot wireguard-vyatta-ubnt tries to resolve the host and if it can not do this because of either wrong DNS settings (as seen in the example above) or lack of internet connection (as I notice on different routers) it does not save the endpoint in the vyatta configuration.
Any ideas how this could be fixed?
I noticed this a couple weeks ago while setting up a new ER-X with the latest Wireguard package. Everything used the latest versions.
Strangely, even though the endpoint line disappeared from the configuration printout, it still connected with the other end! so perhaps this is only a cosmetic issue?
Anyway, I was under extreme time pressure to get this system working so I configured the endpoint statement with a static i.p. address to avoid the possibility of losing connectivity, and then the setting remained in the configuration listing.
Strangely, even though the endpoint line disappeared from the configuration printout, it still connected with the other end! so perhaps this is only a cosmetic issue?
Sorry, I can not confirm this, as soon as the endpoint is not configured, the tunnel is disconnected.
The tunnel reconnects when the endpoint is configured and the change is commited.
maybe an idea for a fix would be one of the following:
- on bootup wait until internet connection is established and dns queries are possible
- do not make dns queries on configuration at all and after configuration continously try to resolve and connect to the endpoint until connection is established.
@vchrizz try turning off route-allowed ips and set the static routes manually. I had a similar issue and that fixed it, I've read there can be incompatibilities with routing modifications?
aka
set interfaces wireguard wg0 route-allowed-ips false
lmk
Thank you for the pointer, unfortunately that did not fix the issue.
But I finally found out what caused the issue...
[background]
Basically we use OLSR as routing-daemon in our network. For that we kind of have a little bit of a "special setup" to have it working on edgemax devices using vyatta. Typically you set an IP from the network on every interface where OLSRd is listening. To not waste IP-addresses, we used a bridge with ebtables -P FORWARD DROP
, set the IP on the bridge and enabled only this interface in OLSRd. Recently we changed that setup to have the IP address on the eth* interfaces; while for linux this is perfectly fine, vyatta does not allow to have an IP on more than one interface. So we use a bootup script to set the IP on all eth* interfaces and do not configure those interfaces in vyatta.
Because we made the "mistake" to put that script in /config/scripts/post-config.d/ the interfaces got their IP later on and so it took OLSRd longer to get all routes from the network what caused a delay until internet connectivity was given.
[/background]
Now we moved that bootup script from /config/scripts/post-config.d/ to /config/scripts/pre-config.d/ and now OLSRd is able to get the routes from the network much earlier and so wireguard is also able to resolve the host-name.
So in the end it was the noted "timing-issue" until internet connectivity is given so wireguard can resolve dns names and establish the connection(s) to the peer(s).